PPO GRPO GSPO DAPO的Loss计算与代码实现

张恒硕 發表於 2025-10-20 17:02:00

PPO GRPO GSPO DAPO的Loss计算与代码实现

首先看一下KL的基础公式
<h3 id="kl">KL</h3>
KL1:
大模型的KL一般是反向的：
<div class="math display">\[KL(\pi_\theta||\pi_{ref}) = E_{x\sim\pi_\theta(\cdot|o_{<t})}log\frac{\pi_\theta(x|o_{<t})}{\pi_{ref}(x|o_{<t})}
\]</div>\(x\sim\pi_\theta(\cdot|o_{<t})\) 代表当前模型根据前t-1个token采样得到第t个token x
KL3(GRPO使用的无偏，低方差KL1估计) http://joschu.net/blog/kl-approx.html：
<div class="math display">\[KL(\pi_\theta||\pi_{ref}) = \mathbb{E}_{x\sim\pi_\theta(\cdot|o_{<t})}\frac{\pi_{ref}}{\pi_{\theta}} - log(\frac{\pi_{ref}}{\pi_{\theta}})-1
\]</div><ul>
<li>正向KL：倾向于使模型分布 Q 覆盖目标分布 P 的所有支持点，适合于需要模型分布更广泛覆盖的情况。</li>
<li>反向KL：倾向于使模型分布 Q 集中在目标分布 P 的高概率区域，适合于生成任务，能够提高生成样本的质量和稳定性。</li>
</ul>
因此，在大语言模型和生成任务中，反向KL通常更受青睐。
<h2 id="不同rl算法-loss的计算">不同RL算法 loss的计算</h2>
对于q的第\(i\)个sample的第\(t\)个token的loss: \(loss_{i,t}=pg\_loss_{i,t}+entropy\_loss_{i, t}+kl\_loss_{i,t}\)
再对一个batch中所有的token loss \(loss_{i,t}\)做聚合agg，得到这个batch的整体loss，可用于后续的反向传播和模型更新。
<table>
<thead>
<tr>
<th>每个token的loss</th>
<th>\(pg\_loss_{i,t}\)</th>
<th>\(kl\_loss_{i,t}\)</th>
<th>lossagg mode</th>
</tr>
</thead>
<tbody>
<tr>
<td>PPO</td>
<td>\(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A_{i,t})\)</td>
<td>\(r_t=-\mathbb{D1}_{KL}(\pi_{old}||\pi_{ref})+r_t\)</td>
<td>\(\frac{1}{|o|}\sum_{t=1}^{|o|}loss_{t}\) token-mean</td>
</tr>
<tr>
<td>Dual-clip PPO</td>
<td>for A<0, \(\min(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A), clip\_c*-A)\)</td>
<td>\(r_t=-\mathbb{D1}_{KL}(\pi_{old}||\pi_{ref})+r_t\)</td>
<td>\(\frac{1}{|o|}\sum_{t=1}^{|o|}loss_{t}\) token-mean</td>
</tr>
<tr>
<td>GRPO</td>
<td>\(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A_{i,t})\)</td>
<td>\(\beta*\mathbb{D3}_{KL}(\pi_{\theta}||\pi_{ref})\)</td>
<td>\(\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}loss_{i,t}\) seq-mean-token-mean</td>
</tr>
<tr>
<td>GSPO</td>
<td>\(IS_{i,t} = sg[\frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)}]*\frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{sg[\pi_{\theta}(o_{i,t}|q,o_{i,<t})]}\) \(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A_{i,t})\)</td>
<td>\(\beta*\mathbb{D3}_{KL}(\pi_{\theta}||\pi_{ref})\)</td>
<td>\(\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}loss_{i,t}\) seq-mean-token-mean</td>
</tr>
<tr>
<td>DAPO</td>
<td>\(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A_{i,t})\)</td>
<td>\(\beta*\mathbb{D3}_{KL}(\pi_{\theta}||\pi_{ref})\)</td>
<td>\(\frac{1}{\sum_{i=1}^G|o_i|}\sum_{i=1}^G\sum_{t=1}^{|o_i|}loss_{i,t}\) token-mean</td>
</tr>
</tbody>
</table>
<h3 id="ppo">PPO</h3>
优化目标：
<div class="math display">\
\]</div>优势： GAE 
递推公式，t步的累积优势=t步的优势+ t+1步的累积优势=t步及之后每一步的优势=t步及之后所有的奖励-第t步的预计奖励
<div class="math display">\[\begin{aligned}
A_t &= (r_t+\gamma V_{t+1}-V_t)+\gamma A_{t+1}\\
A_t &= \sum_{i=t}^T \gamma ^{i-t}(r_t+\gamma V_{t+1}-V_t)\\
A_t &= r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+...+\gamma^{T-t}r_T-V_t\\
\end{aligned}
\]</div>奖励：
<div class="math display">\[r_t=\begin{cases}-KL(\pi_{old}||\pi_{ref}), &t\neq T \\ -KL(\pi_{old}||\pi_{ref})+RM(q,o_i), &t=T \end{cases}
\]</div><code>verl/trainer/ppo/ray_trainer.py</code>verl ｜如何在奖励中添加KL惩罚项？
<pre><code class="language-python">###################################################
# 将KL惩罚loss应用到reward中。原始的reward是
# return KL(\pi_old||\pi_{ref}) + reward
###################################################
def apply_kl_penalty(data: DataProto, kl_ctrl: core_algos.AdaptiveKLController, kl_penalty="kl"):
"""Apply KL penalty to the token-level rewards.

This function computes the KL divergence between the reference policy and current policy,
then applies a penalty to the token-level rewards based on this divergence.

Args:
 data (DataProto): The data containing batched model outputs and inputs.
 kl_ctrl (core_algos.AdaptiveKLController): Controller for adaptive KL penalty.
 kl_penalty (str, optional): Type of KL penalty to apply. Defaults to "kl".

Returns:
 tuple: A tuple containing:
 - The updated data with token-level rewards adjusted by KL penalty
 - A dictionary of metrics related to the KL penalty
"""
response_mask = data.batch["response_mask"]
token_level_scores = data.batch["token_level_scores"]
batch_size = data.batch.batch_size

# compute kl between ref_policy and current policy
# When apply_kl_penalty, algorithm.use_kl_in_reward=True, so the reference model has been enabled.
kld = core_algos.kl_penalty(
 data.batch["old_log_probs"], data.batch["ref_log_prob"], kl_penalty=kl_penalty
)# (batch_size, response_length)
kld = kld * response_mask
beta = kl_ctrl.value

token_level_rewards = token_level_scores - beta * kld
</code></pre>
KL
<div class="math display">\[KL(\pi_{old}||\pi_{ref}) = log(\frac{\pi_{old}(o_t|q, o_{<t})}{\pi_{ref}(o_t|q, o_{<t})})
\]</div>PPO的KL散度是old到ref的
PPO的代码实现详见下面的Dual-clip PPO（PPO的改进版）
<h3 id="dual-clip-ppo">Dual-clip PPO</h3>
https://arxiv.org/pdf/1912.09729：对A<0的token的重要性采样IS做clip
<img src="https://p.ipic.vip/awpfmq.png" alt="image-20251020144504938" style="zoom: 50%">
论文发现当A<0时，重要性采样的比值*A可以是负无穷，这会导致训练不稳定（梯度爆炸）的现象，因此在ppo的clip上，对于A<0又进一步添加了新的clip (clip_ratio_c)。
<div class="math display">\[\mathrm{per\ token\ objection} = \begin{cases}
\min(IS*A, clip(IS, 1-\epsilon, 1+\epsilon)*A), &A\geq0\\
\max(\min(IS*A, clip(IS, 1-\epsilon, 1+\epsilon)*A), clip\_ratio\_c*A), &A<0\\
\end{cases}
\]</div>代码：
整体的ppo_loss是由pg_loss + kl_loss + entropy_loss构成，不同的RL方法pg_loss, kl_loss的计算方法是不同的。
<ul>
<li>pg_loss：具体于<code>verl/trainer/ppo/core_algos.py</code>（我将在dual-clip ppo和gspo部分介绍对应的pg_loss代码）。</li>
<li>kl_loss：同样位于<code>verl/trainer/ppo/core_algos.py</code>（我将会在grpo部分介绍具体的low_var_kl代码）。</li>
</ul>
<code>verl/verl/workers/roles/utils/losses.py</code>: ppo_loss的计算
<pre><code class="language-python">######################################################
# 此函数用于计算整体的actor loss
######################################################
def ppo_loss(config: ActorConfig, model_output, data: TensorDict, dp_group=None):
log_prob = model_output["log_probs"]
entropy = model_output.get("entropy", None)

log_prob = no_padding_2_padding(log_prob, data)# (bsz, response_length)
if entropy is not None:
 entropy = no_padding_2_padding(entropy, data)# (bsz, response_length)

metrics = {}

response_mask = data["response_mask"].to(bool)
# compute policy loss
old_log_prob = data["old_log_probs"]
advantages = data["advantages"]

loss_agg_mode = config.loss_agg_mode

loss_mode = config.policy_loss.get("loss_mode", "vanilla")

policy_loss_fn = get_policy_loss_fn(loss_mode)
# 调用下面的计算pg_loss的代码框
pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = policy_loss_fn(
 old_log_prob=old_log_prob,
 log_prob=log_prob,
 advantages=advantages,
 response_mask=response_mask,
 loss_agg_mode=loss_agg_mode,
 config=config,
)

metrics.update(
 {
 "pg_loss": pg_loss.detach().item(),
 "pg_clipfrac": pg_clipfrac.detach().item(),
 "ppo_kl": ppo_kl.detach().item(),
 "pg_clipfrac_lower": pg_clipfrac_lower.detach().item(),
 }
)
policy_loss = pg_loss

# 是否使用entropy loss
# add entropy loss
if entropy is not None:
 entropy_loss = agg_loss(loss_mat=entropy, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)
 entropy_coeff = config.entropy_coeff
 # token的entropy越大越好，而loss是越小越好，因此是减去 entropy
 policy_loss -= entropy_coeff * entropy_loss

# 是否使用KL loss（grpo/gspo使用，ppo/dapo不使用）
# add kl loss
if config.use_kl_loss:
 ref_log_prob = data["ref_log_prob"]
 # compute kl loss
 kld = kl_penalty(logprob=log_prob, ref_logprob=ref_log_prob, kl_penalty=config.kl_loss_type)
 kl_loss = agg_loss(loss_mat=kld, loss_mask=response_mask, loss_agg_mode=config.loss_agg_mode)

 policy_loss += kl_loss * config.kl_loss_coef
 metrics["kl_loss"] = kl_loss.detach().item()
 metrics["kl_coef"] = config.kl_loss_coef

return policy_loss, metrics
</code></pre>
<code>verl/trainer/ppo/core_algos.py</code>不同的RL方法计算pg_loss是不同的，这里的是ppo的pg_loss，后面还会介绍gspo的pg_loss的实现。
<pre><code class="language-python">######################################################
# 此函数用于计算pg_loss，并不计算KL惩罚项
######################################################
@register_policy_loss("vanilla")# type: ignore
def compute_policy_loss_vanilla(
old_log_prob: torch.Tensor,
log_prob: torch.Tensor,
advantages: torch.Tensor,
response_mask: torch.Tensor,
loss_agg_mode: str = "token-mean",
config: Optional = None,
rollout_is_weights: torch.Tensor | None = None,
) -> tuple:
"""
Compute the clipped policy objective and related metrics for PPO.

Adapted from
https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122

Args:
 old_log_prob (torch.Tensor):
 Log-probabilities of actions under the old policy, shape (batch_size, response_length).
 log_prob (torch.Tensor):
 Log-probabilities of actions under the current policy, shape (batch_size, response_length).
 advantages (torch.Tensor):
 Advantage estimates for each action, shape (batch_size, response_length).
 response_mask (torch.Tensor):
 Mask indicating which tokens to include in the loss, shape (batch_size, response_length).
 loss_agg_mode (str, optional):
 Aggregation mode for `agg_loss`. Defaults to "token-mean".
 config: `(verl.trainer.config.ActorConfig)`:
 config for the actor.
 rollout_log_probs: `(torch.Tensor)`:
 log probabilities of actions under the rollout policy, shape (batch_size, response_length).
"""

assert config is not None
assert not isinstance(config, AlgoConfig)
clip_ratio = config.clip_ratio# Clipping parameter ε for standard PPO. See https://arxiv.org/abs/1707.06347.
clip_ratio_low = config.clip_ratio_low if config.clip_ratio_low is not None else clip_ratio
clip_ratio_high = config.clip_ratio_high if config.clip_ratio_high is not None else clip_ratio
clip_ratio_c = config.get(# Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729.
 "clip_ratio_c", 3.0
)

cliprange = clip_ratio
cliprange_low = clip_ratio_low
cliprange_high = clip_ratio_high

assert clip_ratio_c > 1.0, (
 "The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0,"
 + f" but get the value: {clip_ratio_c}."
)

# 计算每一个token的重要性采样的比值的log
# log(\pi_{\theta}(o_{i,t}|q,o_{i,<t}))-log(\pi_{old}(o_{i,t}|q,o_{i<t}))
negative_approx_kl = log_prob - old_log_prob
# 对IS的log做clip，避免过大或过小
# Clamp negative_approx_kl for stability
negative_approx_kl = torch.clamp(negative_approx_kl, min=-20.0, max=20.0)
# 这里ratio是真正的IS 重要性采样
ratio = torch.exp(negative_approx_kl)
# 计算出-IS在token-level上的均值
ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)

######################################################
# 下面开始计算pg_loss=
#A>0, max(ratio*-A, clip(ratio, 1-\epsilon_low, 1+\epsilon_high)*-A)
#A<0, min(max(ratio*-A, clip(ratio, 1-\epsilon_low, 1+\epsilon_high)*-A), clip_ratio_c*-A)
######################################################
pg_losses1 = -advantages * ratio
if cliprange_low is None:
 cliprange_low = cliprange
if cliprange_high is None:
 cliprange_high = cliprange
# clip后的loss
pg_losses2 = -advantages * torch.clamp(
 ratio, 1 - cliprange_low, 1 + cliprange_high
)# - clip(ratio, 1-cliprange, 1+cliprange) * A
# ppo per token loss
clip_pg_losses1 = torch.maximum(
 pg_losses1, pg_losses2
)# max(-ratio * A, -clip(ratio, 1-cliprange, 1+cliprange) * A)
# 计算被才剪掉的token在这个batch的所有未mask的token的比例（axis=None）【常数】
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)

# 这里是dual-clip PPO提出，使用clip_ratio_c限制A<0的token的loss
pg_losses3 = -advantages * clip_ratio_c
# min(max(ratio*-A, clip(ratio, 1-\epsilon_low, 1+\epsilon_high)*-A), clip_ratio_c*-A)
clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
# 记录在传统ppo下，进一步裁减的A<0的IS大于clip_ratio_c的token在这个batch的所有未mask的token的比例【常数】
pg_clipfrac_lower = verl_F.masked_mean(
 torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask
)

# pg_losses是分段函数(记录每个token的loss)，A<0时用clip_pg_losses2, A>=0时用clip_pg_losses1
pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
# pg_losses: (bsz, response_length)
# 如何计算一整个batch的所有token的整体loss。这有多种方式，主要看配置的loss_agg_mode
pg_loss = agg_loss(loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)

return pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower

</code></pre>
咱们继续看几种token loss的agg mode。不同RL方法，loss agg mode也是不同的
<code>verl/trainer/ppo/core_algos.py</code>
<pre><code class="language-python">def agg_loss(loss_mat: torch.Tensor, loss_mask: torch.Tensor, loss_agg_mode: str):
"""
Aggregate the loss matrix into a scalar.

Args:
 loss_mat: `(torch.Tensor)`:
 shape: (bs, response_length)
 loss_mask: `(torch.Tensor)`:
 shape: (bs, response_length)
 loss_agg_mode: (str) choices:
 method to aggregate the loss matrix into a scalar.
Returns:
 loss: `a scalar torch.Tensor`
 aggregated loss
"""
if loss_agg_mode == "token-mean":
 loss = verl_F.masked_mean(loss_mat, loss_mask)
elif loss_agg_mode == "seq-mean-token-sum":
 seq_losses = torch.sum(loss_mat * loss_mask, dim=-1)# token-sum
 loss = torch.mean(seq_losses)# seq-mean
elif loss_agg_mode == "seq-mean-token-mean":
 seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) / torch.sum(loss_mask, dim=-1)# token-mean
 loss = torch.mean(seq_losses)# seq-mean
elif loss_agg_mode == "seq-mean-token-sum-norm":
 seq_losses = torch.sum(loss_mat * loss_mask, dim=-1)
 loss = torch.sum(seq_losses) / loss_mask.shape[-1]# The divisor
 # (loss_mask.shape[-1]) should ideally be constant
 # throughout training to well-replicate the DrGRPO paper.
 # TODO: Perhaps add user-defined normalizer argument to
 # agg_loss to ensure divisor stays constant throughout.
else:
 raise ValueError(f"Invalid loss_agg_mode: {loss_agg_mode}")

return loss
</code></pre>
<h3 id="grpo">GRPO</h3>
优化目标：
<div class="math display">\-\beta \mathbb{D}_{KL}(\pi_{\theta}||\pi_{ref})\}
\]</div>优势：
<div class="math display">\[A_{i,t} = \frac{r_i-mean(r)}{std(r)}
\]</div>KL3
<div class="math display">\[\mathbb{D}_{KL}(\pi_{\theta}||\pi_{ref}) =\frac{\pi_{ref}(o_{i, t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q, o_{i, <t})}-log(\frac{\pi_{ref}(o_{i, t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q, o_{i, <t})})-1
\]</div>KL3的方差比KL1小，且是KL1的无偏估计
证明
<div class="math display">\[\begin{aligned}
\mathbb{D3}_{KL}(P||Q) &= \sum_{x\sim_{P}}P(x) [\frac{Q(x)}{P(x)} - log(\frac{P(x)}{Q(x)})-1]\\
&= \sum_{x\sim P}Q(x)+P(x)log(\frac{P(x)}{Q(x)})-P(x)\\
&=\sum_{x\sim P}Q(x) -\sum_{x\sim P}P(x)+\mathbb{D1}_{KL}(P||Q) \\
&=\mathbb{D1}_{KL}(P||Q)+\sum_{x\sim P}Q(x)-1\ \ \ \ \ \ \ \ \当P所有采样在Q中的概率和为1时（vocab一样的话）\\
&=\mathbb{D_1}_{KL}(P||Q)
\end{aligned}
\]</div><code>verl/trainer/ppo/core_algos.py</code> 下面是verl对kl_loss的实现：
<pre><code class="language-python">def kl_penalty_forward(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_penalty) -> torch.FloatTensor:
"""Compute KL divergence given logprob and ref_logprob.
Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104
See more description in http://joschu.net/blog/kl-approx.html

Args:
 logprob:
 ref_logprob:

Returns:
 kl_estimate
"""
if kl_penalty in ("kl", "k1"):
 return logprob - ref_logprob

if kl_penalty == "abs":
 return (logprob - ref_logprob).abs()

if kl_penalty in ("mse", "k2"):
 return 0.5 * (logprob - ref_logprob).square()

##############################################################
# 这里的low_var_kl与上述的grpo的KL计算公式相同
##############################################################
# J. Schulman. Approximating kl divergence, 2020.
# # URL http://joschu.net/blog/kl-approx.html.
if kl_penalty in ("low_var_kl", "k3"):
 kl = ref_logprob - logprob
 # For numerical stability
 kl = torch.clamp(kl, min=-20, max=20)
 ratio = torch.exp(kl)
 kld = (ratio - kl - 1).contiguous()
 return torch.clamp(kld, min=-10, max=10)

if kl_penalty == "full":
 # so, here logprob and ref_logprob should contain the logits for every token in vocabulary
 raise NotImplementedError

raise NotImplementedError
</code></pre>
<h3 id="gspo">GSPO</h3>
seq-level 优化目标：
<div class="math display">\
\]</div><div class="math display">\[\frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)} = \frac{\Pi_{t=1}^{|o_i|} \pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\Pi_{t=1}^{|o_i|} \pi_{old}(o_{i,t}|q, o_{i,<t})}
\]</div>token-level 优化目标：
<div class="math display">\[J = \mathbb{E}_{\{o_i\}_{i=1}^G\sim \pi_{old}(\cdot|q)}\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \min(s_{i,t}A_{i,t}, clip(s_{i,t}, 1-\epsilon,1+\epsilon)A_{i,t})\\
\hat{s}_{i,t} = sg[(\frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)})^{\frac{1}{|o_i|}}]* \frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{sg[\pi_{\theta}(o_{i,t}|q,o_{i,<t})]}
\]</div>可以发现的是 \(sg=sg,s_{i}=(\frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)})^{\frac{1}{|o_i|}}\)，但是在方向上不同
通过证明，可以发现，当\(A_{i,t}=A_i\)时，seq-level和token-level在前向传播和反向传播上是一样的 
token-level 可以更好地扩展同sample不同token的A的灵活度（每个token的A可以不相同）
<code>verl/trainer/ppo/core_algos.py</code>
<pre><code class="language-python">##########################################################
# 计算gspo的pg_loss,重点关注IS的计算
##########################################################
@register_policy_loss("gspo")
def compute_policy_loss_gspo(
old_log_prob: torch.Tensor,
log_prob: torch.Tensor,
advantages: torch.Tensor,
response_mask: torch.Tensor,
loss_agg_mode: str = "seq-mean-token-mean",
config: Optional = None,
rollout_is_weights: torch.Tensor | None = None,
) -> tuple:
"""
Compute the clipped policy objective and related metrics for GSPO.

See https://arxiv.org/pdf/2507.18071 for more details.

Args:
 old_log_prob (torch.Tensor):
 Log-probabilities of actions under the old policy, shape (batch_size, response_length).
 log_prob (torch.Tensor):
 Log-probabilities of actions under the current policy, shape (batch_size, response_length).
 advantages (torch.Tensor):
 Advantage estimates for each action, shape (batch_size, response_length).
 response_mask (torch.Tensor):
 Mask indicating which tokens to include in the loss, shape (batch_size, response_length).
 loss_agg_mode (str, optional):
 Aggregation mode for `agg_loss`. For GSPO, it is recommended to use "seq-mean-token-mean".
"""

assert config is not None
assert isinstance(config, ActorConfig)
clip_ratio_low = config.clip_ratio_low if config.clip_ratio_low is not None else config.clip_ratio
clip_ratio_high = config.clip_ratio_high if config.clip_ratio_high is not None else config.clip_ratio

negative_approx_kl = log_prob - old_log_prob

# compute sequence-level importance ratio:
# si(θ) = (π_θ(yi|x)/π_θold(yi|x))^(1/|yi|) =
# exp [(1/|y_i|) * Σ_t log(π_θ(y_i,t|x,y_i,<t)/π_θold(y_i,t|x,y_i,<t))]
seq_lengths = torch.sum(response_mask, dim=-1).clamp(min=1)
negative_approx_kl_seq = torch.sum(negative_approx_kl * response_mask, dim=-1) / seq_lengths

# Combined ratio at token level:
# s_i,t(θ) = sg · π_θ(y_i,t|x, y_i,<t) / sg[π_θ(y_i,t|x, y_i,<t)]
# In log space: log(s_i,t(θ)) = sg + log_prob - sg
log_seq_importance_ratio = log_prob - log_prob.detach() + negative_approx_kl_seq.detach().unsqueeze(-1)
log_seq_importance_ratio = torch.clamp(log_seq_importance_ratio, max=10.0)# clamp for numerical stability

# finaly exp() to remove log
seq_importance_ratio = torch.exp(log_seq_importance_ratio)

pg_losses1 = -advantages * seq_importance_ratio
pg_losses2 = -advantages * torch.clamp(seq_importance_ratio, 1 - clip_ratio_low, 1 + clip_ratio_high)
pg_losses = torch.maximum(pg_losses1, pg_losses2)

# Apply rollout importance sampling weights if provided
if rollout_is_weights is not None:
 pg_losses = pg_losses * rollout_is_weights

# for GSPO, we need to aggregate the loss at the sequence level (seq-mean-token-mean)
pg_loss = agg_loss(loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean")

# For compatibility, return zero for pg_clipfrac_lower (not used in standard GSPO)
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)
pg_clipfrac_lower = torch.tensor(0.0, device=pg_loss.device)

ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)

return pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower

</code></pre>
<h3 id="dapo">DAPO</h3>
优化目标：
<div class="math display">\[\mathcal{J} = \mathbb{E}_{(q,a)\sim \mathcal{D}, \{o_i\}_{i=1}^G\sim \pi_{old}(\cdot|q)} [\frac{1}{\sum_{i=1}^G|o_i|}\sum_{i=1}^G\sum_{t=1}^{|o_i|}\min(r_{i,t}(\theta)A_{i, t}, clip(r_{i,t}(\theta),1-\epsilon_{low}, 1+\epsilon_{high})A_{i,t})]\\
s.t.\ 0<|\{o_i|is\_equivalent(o_i,a)\}|<G
\]</div>其中
<div class="math display">\[r_{i,t}(\theta)=\frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{old}(o_{i,t}|q,o_{i,<t})}, A_{i,t} = \frac{R_i-mean(\{R_i\}_{i=1}^G)}{std(\{R_i\}_{i=1}^G)}
\]</div>其loss agg mode是token-mean。
<h2 id="关于不同rl对kl实现的絮叨">关于不同RL对KL实现的絮叨</h2>
首先，我们要明白的是原始的PPO算法并不是针对LLM设计的，而是gpt-3.5 instruct将PPO算法应用到post-training的RLHF中，原始的PPO的KL loss如下（摘自PPO论文 Proximal Policy Optimization Algorithms）
<img src="https://p.ipic.vip/fb1lsp.png" alt="image-20251024204512938" style="zoom: 33%">
可以发现，这里是添加了\(\pi_{old}\)和\(\pi_{\theta}\)的KL散度，并没有提到\(\pi_{ref}\)，这是因为原始PPO算法只是避免\(\pi_{\theta}\)偏离\(\pi_{old}\)。而在LLM的场景下，为了避免在灾难性遗忘，才特意添加了\(\pi_{ref}\)，避免与原始模型分布偏差过多。
那具体要添加到哪里呢，其实可以发现的是将PPO应用到LLM中时，RM奖励模型只能给出最后一个token的奖励（ORM），并不是传统PPO那种每个token都有奖励（PRM），为了保证在LLM场景下也是密集型奖励，在代码实现时会将KL惩罚项赋值给奖励。GRPO论文提到了PPO在RLHF实现时对每个token的奖励，添加了\(KL(\pi_{\theta}||\pi_{ref})\)。
<img src="https://p.ipic.vip/3snfis.png" alt="image-20251024205511276" style="zoom: 50%">
但是，在具体的<code>verl/trainer/ppo/ray_trainer.py</code>代码实现中（如下），我发现并不是\(KL(\pi_\theta||\pi_{ref})\)而是\(KL(\pi_{old}||\pi_{ref})\)，这是为什么呢？
<pre><code class="language-python">kld = core_algos.kl_penalty(
 data.batch["old_log_probs"], data.batch["ref_log_prob"], kl_penalty=kl_penalty
)# (batch_size, response_length)
</code></pre>
我认为是这样，在代码实现中，首先是rollout并记录experience（old_log, ref_log），然后比较rollout的结果和ground truth得到每个token的reward以及advantage，最后\(\pi_\theta\)再根据experience以及advantage更新模型参数。
因此，为了方便在更新\(\pi_{\theta}\)之前就准备好reward，verl在计算每个token的KL时便使用experience中的old_log代替了在后续训练才能得到的log（在\(\pi_\theta\)上得到的）。
咱们可以进一步地思考一下GRPO算法中的KL惩罚项，GRPO提出组间相对优势，并将ORM的稀疏奖励赋值到所有的token上，并将PPO的reward中的与\(\pi_{ref}\)的惩罚项提出来，专门为每个token添加一个优化项\(\mathbb{D}_{KL}(\pi_{\theta}||\pi_{ref})\)
<hr>
我们进一步分析RFT、PPO和GRPO的优化梯度
RFT：注意，这里的\(\mathbb{I}(o)\)是用来rejected的，将答案错误的\(o\)mask掉，不做训练（梯度更新） 
这里的loss函数\(\mathcal{J}(\theta)\)其实就是每个句子\(o\)的\(log(PPL(o))\) （o的困惑度的log） \(PPL(o) = \frac{1}{(\Pi_{t=1}^{|o|}\pi_{\theta(o_t|q,o_{<t})})^{\frac{1}{|o|}}}\)
<img src="https://p.ipic.vip/1kyj3m.png" alt="image-20251024213901333" loading="lazy">
<hr>
Online RFT（区别在于online RFT的\(o\) 是从当前策略\(\theta\)采样出来的，是on-policy；而RFT的\(o\)是从其他模型采样出来的，是off-policy）
<img src="https://p.ipic.vip/99kuvz.png" alt="image-20251024213926893" loading="lazy">
<hr>
PPO 注意，这里认为\(\pi_{old}=\pi_{\theta}\)，从而得到公式（17）,可以发现，PPO的gradient与online RFT比较类似。 
对于每个token，添加了一个\(\frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{old}(o_t|q,o_{<t})}A_t\)的系数。
<img src="https://p.ipic.vip/g2ixs9.png" alt="image-20251024214422133" loading="lazy">
<hr>
GRPO 可以发现GRPO的梯度的系数比PPO多一个\(\beta(\frac{\pi_{\theta}}{\pi_{old}}-1)\)，这其实就是\(\mathbb{D3}_{KL}(\pi_{\theta}, \pi_{ref})\)对于\(\theta\)的导数。
<img src="https://p.ipic.vip/gs7xuy.png" alt="image-20251024215607143" loading="lazy">
<img src="https://p.ipic.vip/ok0h61.png" alt="image-20251024215618020" loading="lazy">
以上梯度分析参考于GRPO论文：DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 
来源：https://www.cnblogs.com/qlhh/p/19153109

頁: [1]

圆梦公社's Archiver

PPO GRPO GSPO DAPO的Loss计算与代码实现