SPIN原文解析

文档摘要

SPIN 原文解析 SPO 原文中的 loss，由于 $ \widehat{P}(y \succ \pit \mid \mathbf{x}) $ 计算效率偏低，可以使用 Kimi K1.5 中的 loss： To approximate $\tau \log Z$, we use samples $(y1, z1), \ldots, (yk, zk) \sim \pi{\thetai}$: We also find that using empirical mean of sampled rewards yields effective practical results.