廷秀 發表於 2025-8-4 17:36:00

学习理论:代理损失函数的泛化界与Rademacher复杂度

<p>3月份以来的科研基本围绕推导代理损失函数的一致界展开,证明现已基本完工(关于代理损失函数的一致界的介绍可参见之前的博客《学习理论:预测器-拒绝器多分类弃权学习》)。导师建议我可以的话把泛化界也一起证了╰( ̄▽ ̄)╭。由于认识有搞过泛化的朋友,许多东西尽管没有亲自上手做但早已“耳濡目染”了,比如集中不等式这些。这篇博客旨在就代理损失函数的泛化界证明流程进行初步的介绍,并穿插介绍一些学习理论、高维概率中常用的概念。</p>
<h1 id="1-导引">1 导引</h1>
<p>首先,回顾一下多分类弃权学习的基本设置。假设带标签样本集<span class="math inline">\(S=((x_1, y_1), \cdots, (x_m, y_m))\)</span>中的每个样本<span class="math inline">\((x_i, y_i)\)</span>都独立同分布地采自<span class="math inline">\(\mathcal{D}\)</span>,记<span class="math inline">\(S\sim \mathcal{D}^m\)</span>,且有<span class="math inline">\((x_i, y_i)\in \mathcal{X}\times \mathcal{Y}\)</span>(标签<span class="math inline">\(\mathcal{Y} = \{1, \cdots, n\}\)</span>(<span class="math inline">\(n\geqslant 2\)</span>))。</p>
<p>针对多分类问题的预测器-拒绝器弃权损失<sup></sup><span class="math inline">\(L_{\text{abst}}\)</span>定义如下(这里默认采用单阶段的训练设置):</p>
<p></p><div class="math display">\[L_{\text{abst}}(h, r, x, y) = \underbrace{\mathbb{I}_{\text{h}(x) \neq y}\mathbb{I}_{r(x) &gt; 0}}_{\text{不弃权}}+ \underbrace{c \mathbb{I}_{r(x)\leqslant 0}}_{\text{弃权}}
\]</div><p></p><p>其中<span class="math inline">\((h, r)\in \mathcal{H}\times\mathcal{R}\)</span>为预测器-拒绝器对(<span class="math inline">\(\mathcal{H}\)</span>和<span class="math inline">\(\mathcal{R}\)</span>为两个从<span class="math inline">\(\mathcal{X}\)</span>到<span class="math inline">\(\mathbb{R}\)</span>的函数构成的函数族),<span class="math inline">\(\text{h}(x) = \underset{y\in \mathcal{Y}}{\text{arg max}}\space {h(x)}_y\)</span>直接输出实例<span class="math inline">\(x\)</span>的预测标签。假设<span class="math inline">\(c\in (0, 1)\)</span>为一个常数。</p>
<p>在之前的博客中我们提到过,设<span class="math inline">\(\mathcal{l}\)</span>为在标签<span class="math inline">\(\mathcal{Y}\)</span>上定义的0-1多分类损失的代理损失,则我们可以在此基础上进一步定义弃权代理损失<span class="math inline">\(L\)</span>:</p>
<p></p><div class="math display">\[L(h, r, x, y) = \mathcal{l}(h, x, y)\phi(-\alpha r(x)) + c \phi(\beta r(x))
\]</div><p></p><p>不过这里为了进一步简化分析,我们规定这里的拒绝器<span class="math inline">\(r\)</span>不再是学出来的,而是直接根据下列公式计算而得<sup></sup>:</p>
<p></p><div class="math display">\_y - (1 - c)
\]</div><p></p><p>这里<span class="math inline">\(\Psi: ^n\rightarrow \mathbb{R^n}\)</span>定义为将数据的条件概率分布向量<span class="math inline">\(\left(\begin{matrix}
    p(Y=1\mid x)\\
    \vdots\\
    p(Y=n\mid x)
\end{matrix}\right)\in ^n\)</span>映射到最优预测器输出向量<span class="math inline">\(\left(\begin{matrix}
    h(x)_1\\
    \vdots\\
    h(x)_n
\end{matrix}\right)\in \mathbb{R}^n\)</span>的函数。那么<span class="math inline">\(\Psi^{-1}\)</span>就反之,根据最优预测器输出向量给出数据条件概率分布向量的预测值。将<span class="math inline">\(r(x)\)</span>与贝叶斯最优拒绝器<span class="math inline">\(r^*(x) = \max_{y\in \mathcal{Y}}p(y\mid x) - (1 - c)\)</span>的公式进行对比可以更好地对其进行理解)。</p>
<p>于是,弃权代理损失函数可以直接等价于在标签<span class="math inline">\(\mathcal{Y}\)</span>上定义的0-1多分类损失的代理损失<span class="math inline">\(\mathcal{l}\)</span>,下面我们以<strong>一对多(one-versus-all, OVA)</strong> 损失为例进行分析:</p>
<p></p><div class="math display">\[L(h, x, y) = \mathcal{l}(h, x, y) = \phi(h(x)_y) + \sum_{y^{\prime}\neq y}\phi(-h(x)_{y^{\prime}})
\]</div><p></p><p>其中<span class="math inline">\(\phi(\cdot)\)</span>为间隔损失(margin loss)(做为<span class="math inline">\(z \mapsto \mathbb{I}_{z \leqslant 0}\)</span>的上界),可以设置为<span class="math inline">\(\phi(z) = \exp(-z)\)</span>、<span class="math inline">\(\phi(z) = \log(1 + \exp(-z))\)</span>等等。对于一对多损失而言,<span class="math inline">\(\Psi^{-1}\)</span>的计算公式为<span class="math inline">\(\left[\Psi^{-1}(h(x))\right]_y = \frac{\phi^{\prime}(-h(x))_y}{\phi^{\prime}(-h(x))_y + \phi^{\prime}(h(x))_y}\)</span>。</p>
<p>对于目标损失<span class="math inline">\(L_{\text{abst}}\)</span>和代理损失<span class="math inline">\(L\)</span>而言,可分别定义<span class="math inline">\(L_{\text{abst}}\)</span>-期望弃权损失<span class="math inline">\(R_{L_{\text{abst}}}(h, r)\)</span>(也即目标损失函数的泛化误差)和<span class="math inline">\(L\)</span>-期望弃权代理损失<span class="math inline">\(R_{L}(h, r)\)</span>(也即代理损失函数的泛化误差)如下:</p>
<p></p><div class="math display">\, \quad R_{L}(h) = \mathbb{E}_{(x, y)\sim \mathcal{D}}\left
\]</div><p></p><p><span class="math inline">\(R_{L_{\text{abst}}}(h, r)\)</span>和<span class="math inline">\(R_{L}(h, r)\)</span>在所有可测函数构成的集合上取得的下确界分别记为:</p>
<p></p><div class="math display">\[R^*_{L_\text{abst}} = \inf_{h, r}R_{L_{\text{abst}}}(h, r), \quad R_{L}^* = \inf_{h}R_{L}(h)
\]</div><p></p><p>这被称为<strong>贝叶斯最优误差(Bayes-optimal error)</strong>,记<span class="math inline">\(h^* = \text{arg min}_{h}R_{L}(h)\)</span>。</p>
<p><strong>贝叶斯一致性(Bayes-consistency)</strong> <sup></sup>定义这样一种损失函数,这种损失函数能够确保风险最小化的预测器能够成为贝叶斯最优分类器。进一步地,代理损失函数的贝叶斯一致性是指随着训练数据规模<span class="math inline">\(m\rightarrow \infty\)</span>,基于训练样本学到的一系列输出函数<span class="math inline">\(\hat{h}_1, \hat{h}_2, \cdots, \hat{h}_m\)</span>满足下列关系:</p>
<p></p><div class="math display">\[R_L(\hat{h}_m) \xrightarrow{m\rightarrow \infty} R_L^{*} \quad \Longrightarrow \quad R_{L_{\text{abst}}}(\hat{h}, r) \xrightarrow{m\rightarrow \infty} R_{L_{\text{abst}}}^{*}
\]</div><p></p><p>从上式中可以看到,想要刻画代理损失函数的贝叶斯一致性,就需要界定<span class="math inline">\(R_L(\hat{h}) - R_L^{*}\)</span>。这项差距被称为<strong>超额误差(excess error)</strong>,它可以进一步做如下分解<sup></sup>:</p>
<p></p><div class="math display">\[R_L(\hat{h}) - R_L^{*} = \underbrace{\left(R_L(\hat{h}) - R_L^*(\mathcal{H})\right)}_{\text{estimation error}\geqslant 0} + \underbrace{\left(R_L^*(\mathcal{H}) - R_L^{*}\right)}_{\text{approximation error}\geqslant 0}
\]</div><p></p><p>其中<span class="math inline">\(R_L^*(\mathcal{H}) = \inf_{\mathcal{h\in H}}R_L(h)\)</span>为假设类最优(best-in-class)误差,对应的假设类最优假设记为<span class="math inline">\(h^{\circ}\)</span>,假设类最优误差也可以记为<span class="math inline">\(R_L(h^{\circ})\)</span></p>
<ul>
<li>
<p>第一项差值被称为<strong>估计误差(estimation error)</strong>,它衡量了一个模型和假设类最优的模型之间的差距。</p>
</li>
<li>
<p>第二项差值是<strong>逼近误差(approximation error)</strong>,代表了假设类<span class="math inline">\(\mathcal{H}\)</span>在多大程度上涵盖了贝叶斯最优分类器<span class="math inline">\(h^*\)</span>,刻画了模型的表达能力。</p>
</li>
</ul>
<p>若假设假设类和拒绝函数类为所有可测函数构成的集合,即<span class="math inline">\(\mathcal{H} = \mathcal{H}_{\text{all}}, \mathcal{R} = \mathcal{R}_{\text{all}}\)</span>,则上式中第二项逼近误差等于0。若证明以下代理损失函数的<strong>超额误差界(excess error bound)</strong> 成立,则可以直接通过取极限操作得到贝叶斯一致性:</p>
<p></p><div class="math display">\[\underbrace{R_{L_{\text{abst}}}(h, r) - R_{L_{\text{abst}}}^*}_{\text{target excess error}} \leqslant \Gamma \underbrace{\left(R_L(h) - R_L^*\right)}_{\text{surrogate excess error}}
\]</div><p></p><p>其中<span class="math inline">\(R_L^* = \inf_{\mathcal{h}}R_L(h) = R_L^{*}\)</span>。<span class="math inline">\(\Gamma: \mathbb{R}_{+}\rightarrow \mathbb{R}_{+}\)</span>为满足属性<span class="math inline">\(\lim_{t\rightarrow 0^{+}}\Gamma (t) = 0\)</span>的非递减函数。这也是我们在博客《学习理论:预测器-拒绝器多分类弃权学习》)中提到过的内容。在这篇博客中,让我们把注意力转移一个新的方向——代理损失函数的<strong>泛化误差界(generalization error gap)</strong>。</p>
<h1 id="2-泛化误差界">2 泛化误差界</h1>
<p>机器学习理论的一大目标就是最小化估计误差<span class="math inline">\(R_L(\hat{h}) - R_L^{*}(\mathcal{H})\)</span>,然而<span class="math inline">\(\hat{h}\)</span>做为<span class="math inline">\(h^*\)</span>的估计值,是最小化如下所示的经验误差(泛化误差的近似)<sup></sup>来获得的:</p>
<p></p><div class="math display">\[\widehat{R}_L(h) = \mathbb{E}_{(x, y)\sim S}\left = \frac{1}{m}\sum_{i=1}^m L(h, x_i, y_i)
\]</div><p></p><p>其中<span class="math inline">\((x, y)\sim S\)</span>表示<span class="math inline">\((x, y)\)</span>采自样本集<span class="math inline">\(S\)</span>定义的经验分布。这也就是说<span class="math inline">\(\hat{h} = \inf_{h\in \mathcal{H}}\widehat{R}_{L}(h)\)</span>。这也就意味着<span class="math inline">\(\hat{h}\)</span>并不能保证最小化泛化误差<span class="math inline">\(R_L(\hat{h})\)</span>,自然也就无法保证最小化估计误差了。</p>
<p>于是,考虑对估计误差进行进一步的分解<sup></sup>:</p>
<p></p><div class="math display">\[\begin{aligned}
    &amp; R_L(\hat{h}) - R_L^{*}(\mathcal{H}) \\
    &amp;= R_L(\hat{h}) - R_L(h^{\circ}) \quad (h^{\circ} = \text{arg min}_{\mathcal{h\in H}}R_L(h))\\
    &amp;= \underbrace{\left(R_L(\hat{h}) - \widehat{R}_L(\hat{h})\right)}_{\text{generalization gap}\geqslant 0} + \underbrace{\left(\widehat{R}_L(\hat{h}) - \widehat{R}_L(h^{\circ})\right)}_{\text{training error} \leqslant 0 \text{ or } \geqslant 0} + \underbrace{\left(\widehat{R}_L(h^{\circ}) - R_L(h^{\circ})\right)}_{\text{small term} \geqslant 0}
\end{aligned}
\]</div><p></p><ul>
<li>
<p>第一项差值<span class="math inline">\(R_L(\hat{h}) - \widehat{R}_L(\hat{h})\)</span>被称为<strong>泛化差距(generalization gap)</strong>,刻画了学习算法输出假设的泛化能力。证明学习算法的<strong>泛化界(generalization bound)</strong> 即为证明对任意<span class="math inline">\(h\in \mathcal{H}\)</span>,<span class="math inline">\(R_L(h) - \widehat{R}_L(h)\)</span>依概率被某个大致形如<span class="math inline">\(\epsilon = \mathcal{O}(\frac{\mathcal{C}(\mathcal{H})}{m})\)</span>的项给界定,其中<span class="math inline">\(\mathcal{C}(\mathcal{H})\)</span>为假设空间<span class="math inline">\(\mathcal{H}\)</span>的模型复杂度,<span class="math inline">\(m\)</span>为样本个数。按照经典统计学习理论,一般假设空间<span class="math inline">\(\mathcal{H}\)</span>的模型复杂度越低,样本个数<span class="math inline">\(m\)</span>越多,学习算法的泛化性能越好。</p>
<blockquote>
<p><strong>注</strong> 上述理论在深度学习中不一定适用。在深度学习理论中会出现所谓的双下降(double descent),过参数化模型的泛化性能也可能很好,感兴趣的同学可以查看B站南大大佬的视频JOJO想:深度学习中的泛化理论 (9)良性过拟合<sup></sup>。</p>
</blockquote>
</li>
<li>
<p>第二项差值<span class="math inline">\(\underbrace{\widehat{R}_L(\hat{h}) - \widehat{R}_L(h^{\circ})}_{\leqslant 0 \text{ or } \geqslant 0} \leqslant \underbrace{\widehat{R}_L(\hat{h}) - \inf_{h}\widehat{R}_L(h)}_{\geqslant 0}\)</span>和模型的训练误差有关,刻画了训练算法的优化能力。当<span class="math inline">\(\hat{h}\)</span>能够最小化训练误差,也即<span class="math inline">\(\widehat{R}_L(\hat{h}) = \inf_{h}\widehat{R}_L(h)\)</span>时,这一项一定<span class="math inline">\(\leqslant 0\)</span>。</p>
</li>
<li>
<p>第三项差值<span class="math inline">\(\widehat{R}(h^{\circ}) - R(h^{\circ})\)</span>一般是一个很小的项,我们称之为“小项”,也可以被界定。</p>
</li>
</ul>
<p>观察上述分解,可以发现泛化差距<span class="math inline">\(R_L(\hat{h}) - \widehat{R}_L(\hat{h})\)</span>和“小项”<span class="math inline">\(\widehat{R}(h^{\circ}) - R(h^{\circ})\)</span>具有相似的形式,都可以写成<span class="math inline">\(R_L(\cdot)\)</span>和<span class="math inline">\(\widehat{R}(\cdot)\)</span>之间关于某个假设<span class="math inline">\(h\)</span>的差值。而这个差值可以被界定如下(以泛化差距为例,“小项”同理):</p>
<p></p><div class="math display">\[\begin{aligned}
    R_L(\hat{h}) - \widehat{R}_L(\hat{h}) \leqslant \sup_{h\in \mathcal{H}}\lvert \widehat{R}_L(h)- R_L(h)\lvert
\end{aligned}
\]</div><p></p><p>若能证明当<span class="math inline">\(n\rightarrow \infty\)</span>时,<span class="math inline">\(\sup_{h\in \mathcal{H}}\lvert \widehat{R}(h)- R_L(h)\rvert\rightarrow 0\)</span>(比如该上确界被某个大致形如<span class="math inline">\(\epsilon = \mathcal{O}(\frac{\mathcal{C}(\mathcal{H})}{\sqrt{m}})\)</span>或<span class="math inline">\(\mathcal{O}(\frac{\mathcal{C}(\mathcal{H})}{m})\)</span>的项给控制),则我们称假设类<span class="math inline">\(\mathcal{H}\)</span>是<strong>一致收敛(uniform convergence)</strong> 的<sup></sup>。</p>
<blockquote>
<p><strong>注</strong> 注意这里一致收敛的“一致”对应的英文是“uniform”,而我们之前提到过的损失函数一致性的“一致性”对应的英文是“consistency”。</p>
</blockquote>
<p><span class="math inline">\(\sup_{h\in \mathcal{H}}\lvert \widehat{R}_L(h)- R_L(h)\lvert\)</span>可以视为假设<span class="math inline">\(h\in \mathcal{H}\)</span>做为下标索引的随机过程的量纲(magnitude),这种随机过程称为经验过程<sup></sup>。</p>
<p><strong>定义 1</strong> (经验过程)假设类<span class="math inline">\(\mathcal{H}\)</span>上的<strong>经验过程(empirical process)</strong><span class="math inline">\(\left(X_h\right)_{h\in \mathcal{H}}\)</span>定义为:</p>
<p></p><div class="math display">\[X_{h} := \widehat{R}_L(h) - R_L(h)
\]</div><p></p><p>证明泛化界的问题可转化为证明经验过程有界(比如<span class="math inline">\(\sup_{h\in \mathcal{H}}\lvert \widehat{R}_L(h)- R_L(h)\lvert &lt; \epsilon = \mathcal{O}(\frac{\mathcal{C}(\mathcal{H})}{\sqrt{m}})\)</span>)的问题。</p>
<h1 id="3-经验过程的rademacher复杂度上界">3 经验过程的Rademacher复杂度上界</h1>
<p>对于一对多代理损失<span class="math inline">\(L(h, x, y) = \phi\left(h(x)_y\right) + \sum_{y^{\prime}\neq y}\phi\left(-h(x)_{y^{\prime}}\right)\)</span>,以及定义在其上的经验过程<span class="math inline">\(X_{h} := \widehat{R}_L(h) - R_L(h), h\in \mathcal{H}\)</span>,我们有下列定理<sup></sup>成立:</p>
<p><strong>定理 1</strong> 对任意<span class="math inline">\(\delta &gt; 0\)</span>,至少以<span class="math inline">\(1 - \delta\)</span>的概率有</p>
<p></p><div class="math display">\[\sup_{h\in \mathcal{H}}\lvert \widehat{R}_L(h)- R_L(h)\lvert \leqslant 4 n C_{\phi} \mathfrak{R}_S(\mathcal{\Pi_1(H)}) + M_{\phi} \sqrt{\frac{8\log \frac{2}{\delta}}{m}}
\]</div><p></p><p>其中<span class="math inline">\(n\)</span>为标签类别数,<span class="math inline">\(C_{\phi}\)</span>为函数<span class="math inline">\(\phi(\cdot)\)</span>的Lipschitz常数(假设<span class="math inline">\(\phi(\cdot)\)</span>满足Lipschitz连续),<span class="math inline">\(\mathfrak{R}(\mathcal{H}\circ \mathcal{S})\)</span>为定义在样本集<span class="math inline">\(S\)</span>上的假设类<span class="math inline">\(\Pi_1(\mathcal{H}) = \left\{x \mapsto h(x)_y \mid y\in \mathcal{Y}, h\in \mathcal{H}\right\}\)</span>的Rademacher复杂度,<span class="math inline">\(M_{\phi} = \sup_{x\in \mathcal{X}, h\in \mathcal{H}}\left|\phi(h(x))\right|\)</span>。下面先介绍Rademacher复杂度的定义。</p>
<p><strong>定义 2</strong> (Rademacher复杂度) 设样本集<span class="math inline">\(S = (z_1, z_2, \cdots, z_m)\)</span>中的每个样本都独立同分布地采自分布<span class="math inline">\(\mathcal{D}\)</span>,<span class="math inline">\(\mathcal{F}=\{f: \mathcal{Z}\rightarrow \mathbb{R}\}\)</span>为可测函数类。则<span class="math inline">\(\mathcal{F}\)</span>关于样本集<span class="math inline">\(S\)</span>的<strong>Rademacher复杂度</strong><sup></sup>定义为</p>
<p></p><div class="math display">\[\mathfrak{R}_S(\mathcal{H}) = \mathbb{E}_{S}\mathbb{E}_{\sigma\sim \{\pm 1\}^m}\left[\sup_{f\in \mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i f(z_i)\right]
\]</div><p></p><p>其中<span class="math inline">\(\sigma = (\sigma_1, \cdots, \sigma_m)\in \{\pm 1\}^m\)</span>为随机向量,其中每个元素都独立同分布地采自<span class="math inline">\(\{-1, +1\}\)</span>。</p>
<p>为了证明定理 1,我们会先展示一个引理,该引理关联了假设类<span class="math inline">\(\mathcal{H}\)</span>和代理损失函数类<span class="math inline">\(\mathcal{L}\)</span>的Rademacher复杂度。现定义代理损失函数类<span class="math inline">\(\mathcal{L}\)</span>为</p>
<p></p><div class="math display">\[\mathcal{L} = \left\{(x, y)\rightarrow L(h, x, y)\mid h\in \mathcal{H}\right\}
\]</div><p></p><p>则有下列引理<sup></sup>成立:</p>
<p><strong>引理 1</strong> 设<span class="math inline">\(\mathfrak{R}_S(\mathcal{L})\)</span>为<span class="math inline">\(\mathcal{L}\)</span>关于样本集<span class="math inline">\(S\)</span>的Rademacher复杂度,<span class="math inline">\(\mathfrak{R}_S(\mathcal{H})\)</span>为<span class="math inline">\(\mathcal{H}\)</span>关于样本集<span class="math inline">\(S\)</span>的Rademacher复杂度,则我们有<span class="math inline">\(\mathfrak{R}_S(\mathcal{L})\leqslant 2 n C_{\phi} \mathfrak{R}_S(\Pi_1(\mathcal{H}))\)</span>。</p>
<p><strong>引理 1的证明</strong> 按照定义,可以将<span class="math inline">\(\mathfrak{R}_S(\mathcal{L})\)</span>界定如下:</p>
<p></p><div class="math display">\[\begin{aligned}
    \mathfrak{R}_S(\mathcal{L}) &amp;= \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{L\in \mathcal{L}}\frac{1}{m}\sum_{i=1}^m \sigma_i L(x_i, y_i, h)\right] \\
    &amp;= \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{m}\sum_{i=1}^m \sigma_i \left(\phi\left(h(x_i)_{y_i}\right) + \sum_{y^{\prime}\neq y_i}\phi\left(-h(x_i)_{y^{\prime}}\right)\right)\right]\\
    &amp;= \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\left(\frac{1}{m}\sum_{i=1}^m \sigma_i \phi\left(h(x_i)_{y_i}\right) + \frac{1}{m}\sum_{i=1}^m \sigma_i \sum_{y^{\prime}\neq y_i}\phi\left(-h(x_i)_{y^{\prime}}\right)\right)\right]\\
    &amp;\leqslant \underbrace{\mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{m}\sum_{i=1}^m \sigma_i \phi\left(h(x_i)_{y_i}\right)\right]}_{(A)} + \underbrace{\mathbb{E}_S\mathbb{E}_{\sigma} \left[\sup_{h\in \mathcal{H}}\frac{1}{m}\sum_{i=1}^m \sigma_i \sum_{y^{\prime}\neq y_i}\phi\left(-h(x_i)_{y^{\prime}}\right)\right]}_{(B)}
\end{aligned}
\]</div><p></p><p>其中最后一行不等式我们使用了<span class="math inline">\(\sup(\cdot)\)</span>函数的次可加性(sub-additivity)<sup></sup>:<span class="math inline">\(\sup(U + V)\leqslant \sup(U) + \sup(V)\)</span>。接下来我们尝试界定上述两项。设<span class="math inline">\(\alpha_i = 2\mathbb{I}_{y = y_i} - 1\)</span>,可以将第一项<span class="math inline">\((A)\)</span>界定如下:</p>
<p></p><div class="math display">\[\begin{aligned}
    (A) &amp;= \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{m}\sum_{i=1}^m \sigma_i \phi\left(h(x_i)_{y_i}\right)\right]\\
    &amp;= \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{2m}\sum_{i=1}^m \sigma_i \sum_y \phi\left(h(x_i)_{y}\right) (\alpha_i + 1)\right] \\
    &amp;\leqslant \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{2m}\sum_{i=1}^m \sigma_i \sum_y \phi\left(h(x_i)_{y}\right)\alpha_i \right] + \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{2m}\sum_{i=1}^m \sigma_i \sum_y \phi\left(h(x_i)_{y}\right) \right] \\
    &amp;\quad (\text{sub-additivity of} \sup(\cdot))\\
    &amp;= \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{2m}\sum_{i=1}^m \sum_y \phi\left(h(x_i)_{y}\right)\alpha_i\sigma_i \right] + \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{2m}\sum_{i=1}^m \sigma_i \sum_y \phi\left(h(x_i)_{y}\right) \right] \\
    &amp;= \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{2m}\sum_{i=1}^m \sum_y \phi\left(h(x_i)_{y}\right)\sigma_i \right] + \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{2m}\sum_{i=1}^m \sigma_i \sum_y \phi\left(h(x_i)_{y}\right) \right] \\
    &amp; (\alpha_i\sigma_i \text{ has the same distribution as } \sigma_i)\\
    &amp;= \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{m}\sum_{i=1}^m \sigma_i \sum_y \phi\left(h(x_i)_{y}\right) \right]\\
    &amp;\leqslant \sum_y \mathbb{E}_S\mathbb{E}_{\sigma}\left[\sup_{h\in \mathcal{H}}\frac{1}{m}\sum_{i=1}^m \sigma_i \phi\left(h(x_i)_{y}\right) \right] (\text{sub-additivity of} \sup(\cdot))\\
    &amp;= n \mathfrak{R}_S(\phi\circ \Pi_1(\mathcal{H}))
\end{aligned}
\]</div><p></p><p>其中<span class="math inline">\(n\)</span>是类别个数。与<span class="math inline">\((A)\)</span>项类似,<span class="math inline">\((B)\)</span>项也可以被$ \mathfrak{R}_S(\phi\circ \Pi_1(\mathcal{H}))$界定。</p>
<p>于是,我们可以将<span class="math inline">\(\mathcal{L}\)</span>关于样本集<span class="math inline">\(S\)</span>的Rademacher复杂度界定如下:</p>
<p></p><div class="math display">\[\mathfrak{R}_S(\mathcal{L}) \leqslant \underbrace{2 n \mathfrak{R}_S(\phi \circ \Pi_1(\mathcal{H})) \leqslant 2 n C_{\phi} \mathfrak{R}_S(\Pi_1(\mathcal{H}))}_{\text{Talagrand contraction principle}}
\]</div><p></p><p>最后一个不等式根据如下的<strong>Talagrand压缩原理</strong><sup></sup>得到:</p>
<p><strong>引理 2</strong> (Talagrand压缩原理) 设<span class="math inline">\(\phi: \mathbb{R}\rightarrow \mathbb{R}\)</span>为<span class="math inline">\(C_{\phi}\)</span>-Lipschitz函数,则</p>
<p></p><div class="math display">\[    \mathfrak{R}_S(\phi\circ\mathcal{H}) \leqslant C_{\phi}\mathfrak{R}_S(\mathcal{H})
\]</div><p></p><p>其中<span class="math inline">\(\phi\circ \mathcal{H} = \{z\mapsto \phi(h(z)): h\in \mathcal{H}\}\)</span>。</p>
<p>其次,我们介绍一个经验过程的单侧上界关于训练样本<span class="math inline">\(S\)</span>的期望的重要结论<sup></sup>,该结论将经验过程的单侧上界与Rademacher复杂度联系了起来。</p>
<p><strong>引理 3</strong></p>
<p></p><div class="math display">\[\mathbb{E}_{S}\left[\sup_{h\in \mathcal{H}} \left(\widehat{R}_L(h)- R_L(h)\right)\right] \leqslant 2\mathfrak{R}_S(\mathcal{L})
\]</div><p></p><p>这个引理的证明会用到一种称之为<strong>对称化(symmetrization)</strong> 的技术,大致为引入辅助样本集<span class="math inline">\(S^{\prime}\)</span>并以此将泛化误差<span class="math inline">\(R_L(h)\)</span>限制到<span class="math inline">\(S^{\prime}\)</span>上(类似测试集),并利用<span class="math inline">\(S\)</span>和<span class="math inline">\(S^{\prime}\)</span>样本集的对称性来在保持期望不变的前提下完成<span class="math inline">\(\sigma(\cdot)\)</span>函数的引入(这也是为何不等式左边有个期望<span class="math inline">\(\mathbb{E}_S[\cdot]\)</span>),详细证明可参考《Understanding machine learning: From theory to algorithms》Ch.26 <sup></sup>。</p>
<p>有了这些引理,我们可以接下来可以开始正式证明定理 1了。</p>
<blockquote>
<p><strong>注</strong> 下面再将定理 1再贴一遍:对任意<span class="math inline">\(\delta &gt; 0\)</span>,至少以<span class="math inline">\(1 - \delta\)</span>的概率有</p>
<p></p><div class="math display">\[\sup_{h\in \mathcal{H}}\lvert \widehat{R}_L(h)- R_L(h)\lvert \leqslant 4 n C_{\phi} \mathfrak{R}_S(\Pi_1(\mathcal{H})) + M_{\phi} \sqrt{\frac{8\log \frac{2}{\delta}}{m}}
\]</div><p></p></blockquote>
<p><strong>定理 1的证明</strong> 上述不等式是经验过程的双侧界,只需讨论与<span class="math inline">\(\sup_{h\in \mathcal{H}}\left(\widehat{R}_L(h)- R_L(h)\right)\)</span>相关的单侧界至少以概率<span class="math inline">\(1 - \frac{\delta}{2}\)</span>成立即可。</p>
<p>记<span class="math inline">\(A(z_1, \cdots, z_m) = \sup_{h\in \mathcal{H}}\left(\widehat{R}_L(h)- R_L(h)\right)\)</span>,通过引理 3,我们已经知道<span class="math inline">\(
\mathbb{E}_{S}\left \leqslant 2\mathfrak{R}_S(\mathcal{L})
\)</span>成立。现在我们需要考虑<span class="math inline">\(A(z_1, \cdots, z_m)\)</span>的界,那么如果我们可以证明有界变差条件<sup></sup><span class="math inline">\(\left|A(z_1, \cdots, z_m) - A(z_1, \cdots, z_i^{'}, \cdots, z_m)\right|\leqslant c\)</span>(也即将参数中的任意一项<span class="math inline">\(z_i = (x_i, y_i)\)</span>被与训练集独立同分布的<span class="math inline">\(z_i^{\prime} = (x^{\prime}_i, y_i^{\prime})\)</span>替换后,函数值的变化量被常数<span class="math inline">\(c\)</span>界定),那么就可以考虑应用<strong>McDiarmid不等式</strong>得到</p>
<p></p><div class="math display">\[    A(z_1, \cdots, z_m) \leqslant \mathbb{E}_{S}\left + c\sqrt{\frac{m}{2}\log (\frac{1}{\delta}))}
\]</div><p></p><p>以<span class="math inline">\(1 - \delta\)</span>概率成立。</p>
<p>那么我们先尝试界定单侧变差:</p>

<p></p><div class="math display">\[\begin{aligned}
    &amp;A(z_1, \cdots, z_m) - A(z_1, \cdots, z_i^{'}, \cdots, z_m)\\
    &amp;= \sup_{h\in \mathcal{H}}\left[\frac{1}{m}\sum_{j=1}^mL(h, x_j, y_j) - \mathbb{E}_{\mathcal{D}}\left\right]\\
    &amp;\quad \space - \sup_{h^{\prime}\in \mathcal{H}}\left[\frac{1}{m}L(h^{\prime}, x_i^{\prime}, y_i^{\prime}) + \frac{1}{m}\sum_{j\in \backslash\{i\}}L(h^{\prime}, x_j, y_j) - \mathbb{E}_{\mathcal{D}}\left\right]\\
    &amp;= \sup_{h\in \mathcal{H}}\inf_{h^{\prime}\in \mathcal{H}}\left[\frac{1}{m}\sum_{j=1}^mL(h, x_j, y_j) - \mathbb{E}_{\mathcal{D}}\left\right.\\
    &amp;\quad \space \left.- \frac{1}{m}L(h^{\prime}, x_i^{\prime}, y_i^{\prime}) - \frac{1}{m}\sum_{j\in \backslash\{i\}}L(h^{\prime}, x_j, y_j)
    + \mathbb{E}_{\mathcal{D}}\left\right]\\
    &amp;\leqslant \sup_{h\in \mathcal{H}}\left[\frac{1}{m}\sum_{j=1}^mL(h, x_j, y_j) - \mathbb{E}_{\mathcal{D}}\left\right.\\
    &amp;\quad \space \left.- \frac{1}{m}L(h, x_i^{\prime}, y_i^{\prime}) - \frac{1}{m}\sum_{j\in \backslash\{i\}}L(h, x_j, y_j)
    + \mathbb{E}_{\mathcal{D}}\left\right]\\
    &amp;= \sup_{h\in \mathcal{H}}\left[\frac{1}{m}L(h, x_i, y_i) - \frac{1}{m}L(h, x^{\prime}_i, y^{\prime}_i)\right]\\
    &amp;= \frac{1}{m}\sup_{h\in \mathcal{H}}\left[\phi\left(h(x_i)_{y_i}\right) + \sum_{y^{\prime}\neq y_i}\phi\left(-h(x_i)_{y^{\prime}}\right) - \phi\left(h(x_i^{\prime})_{y^{\prime}_i}\right) - \sum_{y^{\prime\prime}\neq y_i^{\prime}}\phi\left(-h(x_i^{\prime})_{y^{\prime\prime}}\right)\right]\\
    &amp;= \frac{1}{m}\sup_{h\in \mathcal{H}}\left[\phi\left(h(x_i)_{y_i}\right) + \phi\left(-h(x_i)_{y_i^{\prime}}\right) - \phi\left(h(x_i^{\prime})_{y_i^{\prime}}\right) - \phi\left(-h(x_i^{\prime})_{y_i}\right)\right]\\
    &amp;\leqslant \frac{4M_{\phi}}{m}
\end{aligned}
\]</div><p></p><p>同理可得<span class="math inline">\(A(z_1, \cdots, z_i^{'}, \cdots, z_m) - A(z_1, \cdots, z_m)\geqslant -\frac{4M_{\phi}}{m}\)</span>。于是<span class="math inline">\(\left|A(z_1, \cdots, z_m) - A(z_1, \cdots, z_i^{'}, \cdots, z_m)\right| \leqslant c = \frac{4M_{\phi}}{m}\)</span>。接着,应用McDiarmid不等式并结合引理 3、引理 1可知</p>
<p></p><div class="math display">\[\begin{aligned}
    \sup_{h\in \mathcal{H}}\left(\widehat{R}_L(h)- R_L(h)\right) &amp;\leqslant \mathbb{E}_{S}\left[\sup_{h\in \mathcal{H}}\left(\widehat{R}_L(h) - R_L(h)\right)\right] + c\sqrt{\frac{m}{2}\log (\frac{1}{\delta}))}\\
    &amp; \quad (\text{McDiarmid inequality})\\
    &amp;\leqslant \mathbb{E}_{S}\left[\sup_{h\in \mathcal{H}}\left(\widehat{R}_L(h) - R_L(h)\right)\right] + M_{\phi}\sqrt{\frac{8\log (\frac{1}{\delta})}{m}} \quad (c = \frac{4M_{\phi}}{m})\\
    &amp;\leqslant 2\mathfrak{R}_S(\mathcal{L}) + M_{\phi}\sqrt{\frac{8\log (\frac{1}{\delta})}{m}} \quad (\text{Lemma } 3)\\
    &amp;\leqslant 4 n C_{\phi} \mathfrak{R}_S(\Pi_1(\mathcal{H}))+ M_{\phi}\sqrt{\frac{8\log (\frac{1}{\delta})}{m}} \quad (\text{Lemma } 1)
\end{aligned}
\]</div><p></p><p>对任意<span class="math inline">\(\delta &gt; 0\)</span>,至少以<span class="math inline">\(1 - \delta\)</span>的概率成立。用<span class="math inline">\(\frac{\delta}{2}\)</span>直接变量替换掉<span class="math inline">\(\delta\)</span>有:<span class="math inline">\(\sup_{h\in \mathcal{H}}\left(\widehat{R}_L(h)- R_L(h)\right)\leqslant 4 n C_{\phi} \mathfrak{R}_S(\Pi_1(\mathcal{H}))+ M_{\phi}\sqrt{\frac{8\log (\frac{2}{\delta})}{m}}\)</span>至少以<span class="math inline">\(1 - \frac{\delta}{2}\)</span>的概率成立。于是对于双侧界,我们有:</p>
<p></p><div class="math display">\[\sup_{h\in \mathcal{H}}\lvert \widehat{R}_L(h)- R_L(h)\lvert \leqslant 4 n C_{\phi} \mathfrak{R}_S(\Pi(\mathcal{H})) + M_{\phi} \sqrt{\frac{8\log \frac{2}{\delta}}{m}}
\]</div><p></p><p>对任意<span class="math inline">\(\delta &gt; 0\)</span>,至少以<span class="math inline">\(1 - \delta\)</span>的概率成立。定理得证。</p>
<p>于是,我们在第2部分提到的泛化差距<span class="math inline">\(R_L(\hat{h}) - \widehat{R}_L(\hat{h})\)</span>和“小项”<span class="math inline">\(\widehat{R}(h^{\circ}) - R(h^{\circ})\)</span>都可以被界定了。而对任意<span class="math inline">\(h\in \mathcal{H}\)</span>,有泛化界</p>
<p></p><div class="math display">\[R_L(h)\leqslant \widehat{R}_L(h) + 4 n C_{\phi} \mathfrak{R}_S(\Pi_1(\mathcal{H}))+ M_{\phi}\sqrt{\frac{8\log (\frac{1}{\delta})}{m}}
\]</div><p></p><p>对任意<span class="math inline">\(\delta &gt; 0\)</span>,至少以<span class="math inline">\(1 - \delta\)</span>的概率成立。至此,代理损失函数的泛化界得证。</p>
<h1 id="参考">参考</h1>
<ul>
<li> Mao A, Mohri M, Zhong Y. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms//International Conference on Algorithmic Learning Theory. PMLR, 2024: 822-867.</li>
<li> Ni C, Charoenphakdee N, Honda J, et al. On the calibration of multiclass classification with rejection. Advances in Neural Information Processing Systems, 2019, 32.</li>
<li> 周志华, 王魏, 高尉, 张利军. 机器学习理论导引. 机械工业出版社, 2020.</li>
<li> Han Bao: Learning Theory Bridges Loss Functions</li>
<li> 滕佳烨、王伯元、张臻: 机器学习理论简明手册</li>
<li> Bartlett P L, Jordan M I, McAuliffe J D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 2006, 101(473): 138-156.</li>
<li> Cortes C, DeSalvo G, Mohri M. Boosting with abstention. Advances in neural information processing systems, 2016, 29.</li>
<li> JOJO想:深度学习中的泛化理论 (9)良性过拟</li>
<li> Shalev-Shwartz S, Ben-David S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.</li>
<li> Wellner J A: Empirical processes: Theory and applications</li>
<li> Vershynin R. High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018.</li>
<li> Mohri M, Rostamizadeh A, Talwalkar A. Foundations of machine learning. MIT press, 2018.</li>
<li> Stanford CS229T/STATS231: Statistical Learning Theory - Lecture #6</li>
</ul>


</div>
<div id="MySignature" role="contentinfo">
    数学是符号的艺术,音乐是上界的语言。<br><br>
来源:https://www.cnblogs.com/orion-orion/p/19021877
頁: [1]
查看完整版本: 学习理论:代理损失函数的泛化界与Rademacher复杂度