学习理论:凸代理、代理与估计误差界
<p>这学期参加了同研究科的田中研的读书会,所选的是近年出的较新的书《Learning Theory from First Principles》<sup></sup>:</p><p align="center">
<img src="https://images.cnblogs.com/cnblogs_com/blogs/538207/galleries/2445565/o_251121114423_Learning%20Theory%20from%20First%20Principles.jpeg" width="330" height="420" alt="" align="center">
</p>
<p>作者Francis Bach是COLT2025的keynote speaker<sup></sup>。我承担了4.1-4.4部分(这周做了分享),该部分和我目前的科研方向比较相关。下面是我结合自己的科研的一些心得和笔记。</p>
<h1 id="1-导引">1 导引</h1>
<p>给定<span class="math inline">\(\mathcal{X}\times\mathcal{Y}\)</span>上的联合概率分布<span class="math inline">\(p(x, y)\)</span>,以及损失函数<span class="math inline">\(L:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb{R}\)</span>,定义预测函数<span class="math inline">\(h:\mathcal{X}\rightarrow \mathcal{Y}\)</span>的<strong>期望风险(expected risk)</strong>(也称泛化误差(generalization error)或测试误差(testing error)):</p>
<p></p><div class="math display">\ = \int_{\mathcal{X}\times \mathcal{Y}}L(y, h(x))\mathrm{d} p
\]</div><p></p><p>对于贝叶斯线性回归这类输出的是一个预测分布<span class="math inline">\(\hat{p}(t\mid x)\)</span>的任务,其预测函数<span class="math inline">\(h(x)\)</span>可以由条件期望<span class="math inline">\(h(x) = \mathbb{E}_t = \int t\mathrm{d}\hat{p}\)</span>表示,于是其期望风险可以写为</p>
<p></p><div class="math display">\[R(h)= \int\int_{\mathcal{X}\times \mathcal{Y}}L(y, t)\mathrm{d} p \mathrm{d}\hat{p}
\]</div><p></p><blockquote>
<p><strong>注</strong> 当从数据中进行学习时,<span class="math inline">\(h\)</span>依赖于随机训练数据而非测试数据,因此尽管<span class="math inline">\(R(h)\)</span>的表达式中关于测试数据取了期望,但在通常情况下它仍然是随机的。然而,<span class="math inline">\(R(h)\)</span>做为关于<span class="math inline">\(h\)</span>的泛函是确定性的。</p>
</blockquote>
<p>接下来考虑学习问题,给定从概率分布<span class="math inline">\(p(x, y)\)</span>中i.i.d.采样的大小为<span class="math inline">\(n\)</span>的随机样本,目标为学习一个预测函数<span class="math inline">\(h:\mathcal{X}\rightarrow \mathcal{Y}\)</span>使得期望风险<span class="math inline">\(R(h)\)</span>最小化,或等价地,使得下列<strong>超额风险(excess risk)</strong>(或称超额误差(excess error))最小:</p>
<p></p><div class="math display">\[R(h) - R^* = R(h) - \inf_{h \space \text{measurable}}R(h)
\]</div><p></p><p>其中<span class="math inline">\(\inf_{h \space \text{measurable}}\)</span>中的<span class="math inline">\(h\)</span>为可测函数。在这篇博客中我们关注基于经验风险最小化方法的统计分析。</p>
<p>然而,以二分类问题为例,预测函数<span class="math inline">\(h\)</span>是个布尔函数,在其假设空间中的经验风险最小化是难以处理的(intractable)组合优化问题。此外,分析这种布尔函数假设空间的复杂度可能需要进一步引入诸如VC维的工具。然而,可以通过凸代理损失的框架来学习一个实值函数<span class="math inline">\(f\)</span>以将该问题凸化从而克服之。此时,我们就可以使用拉德马赫复杂度来对经验风险最小化的渐进收敛进行分析。</p>
<p>考虑二分类场景,做为直接学习<span class="math inline">\(h: \mathcal{X}\rightarrow \left\{-1, +1\right\}\)</span>的替代,我们采用学习实值函数<span class="math inline">\(f: \mathcal{X}\rightarrow\mathbb{R}\)</span>(其对应的函数族为<span class="math inline">\(\mathcal{F}\)</span>)的方法,并定义<span class="math inline">\(h(x) = \mathrm{sign}(f(x))\)</span>,其中</p>
<p></p><div class="math display">\[h(x) = \left\{\begin{aligned}
+1 \quad f(x) \geqslant 0\\
-1 \quad f(x) < 0
\end{aligned} \right.
\]</div><p></p><p>我们将关于函数<span class="math inline">\(h = \text{sign}\circ f\)</span>的的0-1风险直接记为<span class="math inline">\(R_{\text{0-1}}(f)\)</span>,它可以表示为:</p>
<p></p><div class="math display">\
\]</div><p></p><p>注意,这里出于简洁,未严格讨论<span class="math inline">\(f(x)=0\)</span>的情形。记<span class="math inline">\(\mathbb{I}\left(yf(x)\leqslant 0\right):= \Phi_{\text{0-1}}\left(yf(x)\right)\)</span>,其中<span class="math inline">\(\Phi_{\text{0-1}}\left(\cdot\right)\)</span>称为“基于间隔的”0-1损失函数(注意与满足定义<span class="math inline">\(L_{\text{0-1}}(h(x), y) = \mathbb{I}\left(h(x)\neq y\right)\)</span>的0-1损失函数<span class="math inline">\(L_{\text{0-1}}(\cdot, \cdot)\)</span>相区分)。</p>
<h1 id="2-凸代理">2 凸代理</h1>
<p>优化上述<span class="math inline">\(R(f)\)</span>的经验版本<span class="math inline">\(\frac{1}{m}\sum_{i=1}^m\mathbb{I}\left(y_if(x_i)\leqslant 0\right)\)</span>是NP-hard的(由于<span class="math inline">\(\mathbb{I}\left(yf(x)\leqslant 0\right)\)</span>非凸不连续),因此我们可以考虑使用<strong>凸代理(convex surrogates)</strong> 技术对目标函数<span class="math inline">\(\Phi_{\text{0-1}} = \mathbb{I}\left(yf(x)\leqslant 0\right)\)</span>进行凸放松,其中我们使用具有更好的数值属性的函数<span class="math inline">\(\Phi\)</span>(主要是凸性)对<span class="math inline">\(\Phi_{\text{0-1}}\)</span>进行替代。下面是一些凸代理损失函数的例子<sup></sup>:</p>
<p align="center">
<img src="https://images.cnblogs.com/cnblogs_com/blogs/538207/galleries/2445565/o_71faa869.png" width="580" height="420" alt="" align="center">
</p>
<blockquote>
<p><strong>注</strong> 深度学习中的sigmoid函数<span class="math inline">\(\Phi(t) = 1 / (1 + e^{-t})\)</span>也可以看做一种针对<span class="math inline">\(\frac{1}{2}\mathbb{I}(t\geqslant 0)\)</span>的光滑连续的非凸替代函数<sup></sup>。</p>
</blockquote>
<p>于是,做为对直接优化0-1风险<span class="math inline">\(R_{\text{0-1}}(f)\)</span>(或其经验版本)的替代,我们转而优化下列的代理风险<span class="math inline">\(R_{\Phi}(f)\)</span>(及其经验版本):</p>
<p></p><div class="math display">\
\]</div><p></p><h1 id="3-校准与条件风险">3 校准与条件风险</h1>
<p>理想的代理损失函数<span class="math inline">\(\Phi\)</span>应该满足<strong>代理风险一致性(Surrogate risk consistency)</strong><sup></sup>:在总体层面上对<span class="math inline">\(\Phi\)</span>-风险的优化能够确保优化对0-1风险的优化:</p>
<p><strong>代理风险一致性 (Surrogate risk consistency)</strong> 若对优化代理经验风险得到的任意函数序列<span class="math inline">\(\hat{f}_1, \hat{f}_2, \cdots, \hat{f}_m, \cdots\)</span>,以及任意分布<span class="math inline">\(p(x, y)\)</span>满足:</p>
<p></p><div class="math display">\[R_{\Phi}(f_m)\xrightarrow{m\rightarrow \infty} R_{\Phi}^* \Longrightarrow R_{\text{0-1}}(f_m)\xrightarrow{m\rightarrow \infty} R^*_{\text{0-1}}
\]</div><p></p><p>则称损失<span class="math inline">\(\Phi\)</span>对原目标损失<span class="math inline">\(\mathcal{\Phi}_{\text{0-1}}\)</span>是<strong>代理风险一致</strong>的,或<strong>Bayes一致</strong>的/<strong>Fisher一致</strong>的,有时也称代理损失函数<span class="math inline">\(\Phi\)</span>是<strong>校准 (calibrated)</strong> 的。直观地理解可以参见下图<sup></sup>:</p>
<p align="center">
<img src="https://images.cnblogs.com/cnblogs_com/blogs/538207/galleries/2445565/o_250222011842_%E6%A0%A1%E5%87%86%E7%9A%84%E4%BB%A3%E7%90%86%E6%8D%9F%E5%A4%B1%E7%9A%84%E6%80%A7%E8%B4%A8.png" width="650" height="420" alt="" align="center">
</p>
<p>直观理解地,代理风险一致性也就意味着<strong>能够最小化<span class="math inline">\(R_{\Phi}(f)\)</span>的输出函数也能够最小化<span class="math inline">\(R_{\text{0-1}}(f)\)</span></strong>。那么我们接下来先来考虑最小化<span class="math inline">\(R_{\text{0-1}}(f)\)</span>的问题。</p>
<p></p><div class="math display">\
\]</div><p></p><p>为了进一步简化后续的分析,我们根据条件期望以及其相关的全期望定理,将<span class="math inline">\(R_{\text{0-1}}(f)\)</span>写为:</p>
<p></p><div class="math display">\[\begin{aligned}
R_{\text{0-1}}(f) &=\mathbb{E}_{x}\left[\mathbb{E}_{y}\left[\Phi_{\text{0-1}}\left(yf(x)\right)\mid x\right]\right] \\
&= \mathbb{E}_{x}\underbrace{\left[\eta(x)\Phi_{\text{0-1}}\left(f(x)\right) + (1 - \eta(x))\Phi_{\text{0-1}}\left(-f(x)\right)\right]}_{\text{conditional risk }C_{\text{0-1}}}
\end{aligned}
\]</div><p></p><p>其中<span class="math inline">\(\eta(x) = p(y=+1\mid x)\)</span>,我们有<span class="math inline">\(\mathbb{E} = 2\eta(x) - 1\)</span>。我们称其中内层的条件期望项为0-1间隔损失<span class="math inline">\(\Phi_{\text{0-1}}\)</span>的<strong>条件风险(conditional risk)</strong>(也称为pointwise风险<sup></sup>),由于在其计算过程中<span class="math inline">\(y\)</span>取期望取掉了,因此该项只和<span class="math inline">\(f(x)\)</span>相关,因此我们将其记为<span class="math inline">\(C_{\text{0-1}}(f)\)</span>:</p>
<p></p><div class="math display">\[\begin{aligned}
C_{\text{0-1}}(f) &= \mathbb{E}_{y}\left[\Phi_{\text{0-1}}\left(yf(x)\right)\mid x\right] \\
& = \eta(x)\Phi_{\text{0-1}}\left(f(x)\right) + (1 - \eta(x))\Phi_{\text{0-1}}\left(-f(x)\right)
\end{aligned}
\]</div><p></p><p>我们用<span class="math inline">\(C^*_{\text{0-1}} = \inf_{f} C_{\text{0-1}}(f)\)</span>来表示最优的<span class="math inline">\(\Phi_{\text{0-1}}\)</span>的条件风险,可得</p>
<p></p><div class="math display">\[ C^*_{\text{0-1}} = \min\left\{\eta(x), 1 - \eta(x)\right\}
\]</div><p></p><p>而最优<span class="math inline">\(0-1\)</span>风险(也称<strong>贝叶斯风险(Bayes risk)</strong>)可以表示为:</p>
<p></p><div class="math display">\[ R_{\text{0-1}}^* = \mathbb{E}_{x}\left[\min\left\{\eta(x), 1 - \eta(x)\right\}\right]
\]</div><p></p><p>上式也可进一步写为$ R_{\text{0-1}}^* = \mathbb{E}_{x}\left[\frac{1}{2} - \frac{1}{2}\left|\mathbb{E}\right|\right]$。</p>
<p>此时最优实值函数(也称<strong>贝叶斯实值函数(Bayes real-valued function)</strong>)<span class="math inline">\(f^*_{\text{0-1}}\)</span>需要满足当<span class="math inline">\(\eta(x) < \frac{1}{2}\)</span>时,<span class="math inline">\(f^*_{\text{0-1}}(x) < 0\)</span>,当<span class="math inline">\(\eta(x) > \frac{1}{2}\)</span>时,<span class="math inline">\(f^*_{\text{0-1}}(x) > 0\)</span>,当<span class="math inline">\(\eta(x)=\frac{1}{2}\)</span>时,<span class="math inline">\(f^*_{\text{0-1}}(x)\)</span>可以是任意实数。于是,<span class="math inline">\(f^*_{\text{0-1}}\)</span>满足:</p>
<p></p><div class="math display">\[f^*_{\text{0-1}}\in \mathcal{F}^* = \left\{f: \begin{aligned}
&\text{当} \eta(x) = \frac{1}{2} \text{时} f(x) \text{可以是任意的实数;}\\
&\text{当} \eta(x) \neq \frac{1}{2} \text{时} \underbrace{f(x)(\eta(x) - \frac{1}{2}) > 0}_{f(x) \text{与} 2\eta(x) - 1 (=\mathbb{E})\text{同号}}
\end{aligned}\right\}
\]</div><p></p><p>我们称满足<span class="math inline">\(h^* = \mathrm{sign}(2\eta - 1)\)</span>(<span class="math inline">\(=\mathrm{sign}\left(\mathbb{E}\right)\)</span>)的分类器为贝叶斯最优分类器,或直接简称为<strong>贝叶斯分类器(Bayes's classifier)</strong>。这里可以与在线性回归问题中得到的最优预测器<span class="math inline">\(h^* = \mathbb{E}\)</span>进行对照。</p>
<p>我们用<span class="math inline">\(C_{\Phi}\)</span>来表示代理损失<span class="math inline">\(\Phi\)</span>的条件风险,并用<span class="math inline">\(C^*_{\Phi}\)</span>来表示其对应的最优条件风险,用<span class="math inline">\(R^*_{\Phi}\)</span>来表示其对应的最优期望风险。</p>
<p>由上述内容可知,对于代理损失<span class="math inline">\(\Phi\)</span>而言,如果也能确保<strong>其对应的最优实值输出函数<span class="math inline">\(f_{\Phi}^*\)</span>在<span class="math inline">\(\eta(x) \neq \frac{1}{2}\)</span>时与<span class="math inline">\(2\eta(x) - 1\)</span>同号</strong>,那么它就是校准的。</p>
<p>可以证明,当<span class="math inline">\(\Phi\)</span>为凸函数时,存在一个可用的校准充要条件,如下述命题所示:</p>
<p><strong>命题</strong> <sup></sup> 设<span class="math inline">\(\Phi: \mathbb{R}\rightarrow \mathbb{R}\)</span> 为凸函数。代理函数<span class="math inline">\(\Phi\)</span>是校准的当且仅当<span class="math inline">\(\Phi\)</span>在<span class="math inline">\(0\)</span>处可微且<span class="math inline">\(\Phi^{\prime}(0) < 0\)</span>。</p>
<p><strong>证明</strong> 由于<span class="math inline">\(\Phi\)</span>是凸的,则<span class="math inline">\(C_{\Phi}\)</span>也是凸的,因此可以考虑<span class="math inline">\(\eta(x)\neq \frac{1}{2}\)</span>时, <span class="math inline">\(C_{\Phi}(f)\)</span>在0处的左和右导数以对关于最小值的位置进行分类讨论,有如下所示的<span class="math inline">\((a)\)</span>和<span class="math inline">\((b)\)</span>两种可能(最优点$ f^<em>(x)> 0<span class="math inline">\(当且仅当在0点的右导数严格为负, 最优点\)</span> f^</em>(x) < 0$当且仅当在0点的左导数严格为正):</p>
<p align="center">
<img src="https://images.cnblogs.com/cnblogs_com/blogs/538207/galleries/2445565/o_931dc5bb.png" width="650" height="200" alt="" align="center">
</p>
<p>(a) <span class="math inline">\(f^*_{\Phi}(x) > 0 \Leftrightarrow C_{\Phi}^{\prime}(0_{+}) = \eta(x)\Phi^{\prime}\left(0_{+}\right) - (1 - \eta(x))\Phi^{\prime}\left(0_{-}\right) < 0\)</span><br>
(b) <span class="math inline">\(f^*_{\Phi}(x) < 0 \Leftrightarrow C_{\Phi}^{\prime}(0_{-}) = \eta(x)\Phi^{\prime}\left(0_{-}\right) - (1 - \eta(x))\Phi^{\prime}\left(0_{+}\right) > 0\)</span></p>
<p>先证明充分性。假设<span class="math inline">\(\Phi\)</span>是校正的。如令上述等式<span class="math inline">\((a)\)</span>中的<span class="math inline">\(\eta\rightarrow {\frac{1}{2}}_{+}\)</span>,这将导致<span class="math inline">\(C_{\Phi}^{\prime}(0_{+}) = \frac{1}{2} \left[\Phi^{\prime}\left(0_{+}\right) - \Phi^{\prime}\left(0_{-}\right)\right]\leqslant 0\)</span>。由于<span class="math inline">\(\Phi\)</span>是凸的,等式<span class="math inline">\(\Phi^{\prime}\left(0_{+}\right) - \Phi^{\prime}\left(0_{-}\right)\geqslant 0\)</span>恒成立。因此,<span class="math inline">\(\Phi^{\prime}\left(0_{+}\right) = \Phi^{\prime}\left(0_{-}\right)\)</span>,这意味着<span class="math inline">\(\Phi\)</span>在<span class="math inline">\(0\)</span>处是可微的。由<span class="math inline">\((a), (b)\)</span>与之前提到的同号条件可得,<span class="math inline">\(\Phi^{\prime}(0) < 0\)</span>。</p>
<p>再证明必要性。假设<span class="math inline">\(\Phi\)</span>在0点是可微的且<span class="math inline">\(\Phi^{\prime}(0) < 0\)</span>,则<span class="math inline">\(C_{\Phi}^{\prime}(0) = (2\eta - 1)\Phi^{\prime}(0)\)</span>,则由<span class="math inline">\((a), (b)\)</span>可得之前提到的同号条件。于是原命题得证。</p>
<p>注意上述命题排除了凸代理<span class="math inline">\(\Phi(f) = (1-f)+ = \max\left\{1-f, 0\right\}\)</span>,该函数在0点处不可微。此外,之前我们在第2部分提到的所有代理损失函数的例子都是校准的。</p>
<blockquote>
<p><strong>注</strong> 在使用概率模型进行分类的情况下,若学习<span class="math inline">\(p(y=1\mid x)\)</span>的模型,则校准也可能指关于这个概率的估计的准确度。</p>
</blockquote>
<h1 id="4-代理损失函数的超额风险界">4 代理损失函数的超额风险界</h1>
<p>由前文叙述可知,任意<span class="math inline">\(x\in \mathcal{X}\)</span>,若代理损失函数<span class="math inline">\(\Phi\)</span>是校准的,则最小化条件代理风险<span class="math inline">\(C_{\Phi}(f)\)</span>可以得到最优预测函数<span class="math inline">\(h^* = \text{sign}(f^*)\)</span>。这个结果是当样本数量<span class="math inline">\(m\rightarrow \infty\)</span>时的渐近(asymptotic)结果。</p>
<p>接下来,我们要更进一步,得到一个非渐进(non-asymptotic)的结果:如果能够控制代理超额风险(可以通过经验风险最小化达到),就能够控制目标超额风险。形式化地说,我们需要证明如下所示的<strong>代理损失函数的超额风险界</strong>:</p>
<p></p><div class="math display">\[ R_{\text{0-1}}(f) - R^*_{\text{0-1}}\leqslant \Gamma\left(R_{\Phi}(f) - R_{\Phi}^*\right)
\]</div><p></p><p>对<span class="math inline">\(\forall f(x)\in \mathbb{R}\)</span>成立,其中<span class="math inline">\(\Gamma: \mathbb{R}_{+}\rightarrow \mathbb{R}_{+}\)</span>为满足属性<span class="math inline">\(\lim_{t\rightarrow 0^{+}}\Gamma (t) = 0\)</span>的非递减函数。</p>
<p>函数<span class="math inline">\(\Gamma(t)\)</span>有时被称为<strong>校准函数(calibration function)</strong> 或 <strong>超额风险率(excess risk rate)</strong>。这里我们关注当<span class="math inline">\(t\rightarrow 0^{+}\)</span>时,函数<span class="math inline">\(\Gamma(t)\)</span>的表现。后面我们会证明hinge损失的校准函数是<span class="math inline">\(\mathcal{O}(t)\)</span>的,而诸如平方和logistic损失的光滑凸代理则可能会导致<span class="math inline">\(\mathcal{O}(\sqrt{t})\)</span>的校准函数。</p>
<blockquote>
<p><strong>注</strong> 由于这里我们关注的是<span class="math inline">\(t\rightarrow 0^{+}\)</span>时的情况,所以<span class="math inline">\(\mathcal{O}(t)\)</span>是比<span class="math inline">\(\mathcal{O}(\sqrt{t})\)</span>更紧的结果。因此,尽管hinge损失在光滑的性质上没平方和logistic损失那么好,但它所导出的校准函数结果更好,这也是一种“权衡”的体现。</p>
</blockquote>
<p>现在我们考虑将期望风险转换为我们之前提到的条件风险以简化分析。由条件风险的定义可得,对<span class="math inline">\(\forall f(x)\in \mathbb{R}\)</span>,有</p>
<p></p><div class="math display">\[\begin{aligned}
\mathbb{E}_x\left\leqslant \Gamma\left(\mathbb{E}_x\left\right)
\end{aligned}
\]</div><p></p><p>若函数<span class="math inline">\(\Gamma\)</span>是凹函数,则可利用Jensen不等式得到</p>
<p></p><div class="math display">\[\mathbb{E}_x\left\leqslant \mathbb{E}_x\left[\Gamma\left(C_{\Phi}(f) - C_{\Phi}^*\right)\right]
\]</div><p></p><p>于是,若能证明<span class="math inline">\(C_{\text{0-1}}(f) - C_{\text{0-1}}^*\leqslant \Gamma(C_{\Phi}(f) - C_{\Phi}^*)\)</span>对<span class="math inline">\(\forall f(x)\in \mathbb{R}\)</span>成立,则代理损失函数的超额风险界得证。</p>
<p><strong>超额条件0-1风险的表达式</strong></p>
<p>接下来我们考虑对<span class="math inline">\(C_{\text{0-1}}(f) - C_{\text{0-1}}^*\)</span>进行表示。由前面提到的知识可知:</p>
<p></p><div class="math display">\[C_{\text{0-1}}(f) - C^*_{\text{0-1}} = \eta\Phi_{\text{0-1}}\left(f\right) + (1 - \eta)\Phi_{\text{0-1}}\left(-f\right) - \min\left\{\eta, 1 - \eta\right\}
\]</div><p></p><p>接下来我们需要对<span class="math inline">\(\eta(x)\)</span>和<span class="math inline">\(f(x)\)</span>的取值情况进行分类讨论:</p>
<ol>
<li>
<p><span class="math inline">\(\eta(x) = \frac{1}{2}\)</span>:</p>
<p><span class="math inline">\(C_{\text{0-1}}(f) = \eta\Phi_{\text{0-1}}\left(f\right) + (1 - \eta)\Phi_{\text{0-1}}\left(-f\right) = \frac{1}{2}\)</span>为常值函数,故<span class="math inline">\(C_{\text{0-1}}(f) - C_{\text{0-1}}^* = 0\)</span>。</p>
</li>
<li>
<p><span class="math inline">\(\eta(x) > \frac{1}{2}\)</span>:</p>
<p>此时<span class="math inline">\(C^*_{\text{0-1}} = 1 - \eta\)</span>(在$ f(x) > 0$时取得)。</p>
<p>a. 当<span class="math inline">\(f(x) \leqslant 0\)</span>时,<span class="math inline">\(C_{\text{0-1}}(f) - C_{\text{0-1}}^* = 2\eta - 1\)</span>;<br>
b. 当<span class="math inline">\(f(x) > 0\)</span>时,<span class="math inline">\(C_{\text{0-1}}(f) - C_{\text{0-1}}^* = 0\)</span>。<br>
于是可以得到</p>
</li>
</ol>
<p></p><div class="math display">\[ \forall f(x)\in \mathbb{R}, C_{\text{0-1}}(f) - C_{\text{0-1}}^* = \left(2\eta - 1\right)\Phi_{\text{0-1}}\left(f\right)
\]</div><p></p><ol start="3">
<li>
<p><span class="math inline">\(\eta < \frac{1}{2}\)</span>:</p>
<p>同理可得<span class="math inline">\(C_{\text{0-1}}(f) - C_{\text{0-1}}^* = \left(1 - 2\eta(x)\right)\Phi_{\text{0-1}}(-f(x))\)</span>。</p>
</li>
</ol>
<p>综上所述,对任意<span class="math inline">\(\eta(x)\in \)</span>:</p>
<p></p><div class="math display">\[ \forall f(x)\in \mathbb{R}, C_{\text{0-1}}(f) - C_{\text{0-1}}^* = |2\eta - 1|\Phi_{\text{0-1}}\left((2\eta - 1)f\right)
\]</div><p></p><p>我们也可以对其进行放松,以得到一个更加实用的结果:</p>
<p></p><div class="math display">\[\begin{aligned}
\forall f(x)\in \mathbb{R}, C_{\text{0-1}}(f) - C_{\text{0-1}}^* &\leqslant |2\eta - 1 - b(f)|\Phi_{\text{0-1}}\left((2\eta - 1)f\right)\\
&\leqslant |2\eta - 1 - b\left(f\right)|
\end{aligned}
\]</div><p></p><p>其中<span class="math inline">\(b(f)\)</span>为不改变<span class="math inline">\(f(x)\)</span>符号的保号函数。</p>
<p><strong>平方代理损失</strong> 对平方代理损失<span class="math inline">\(\Phi(f) = (f - 1)^2\)</span>,我们有</p>
<p></p><div class="math display">\[\begin{aligned}
C_{\Phi}(f) = \eta(f - 1)^2 + (1 - \eta)(-f - 1)^2
\end{aligned}
\]</div><p></p><p><span class="math inline">\(C_{\Phi}^{\prime}(f) = 0\)</span>当且仅当<span class="math inline">\(f^{*}_{\Phi} = 2\eta - 1\)</span>时满足,此时有<span class="math inline">\(C_{\Phi}^* = 4\eta(1 - \eta)\)</span>,故</p>
<p></p><div class="math display">\[\begin{aligned}
C_{\Phi}(f) - C_{\Phi}^* &= f^2 - 2\left(2\eta - 1\right)f + (2\eta - 1)^2\\
&= \left(f - \left(2\eta - 1\right)\right)^2\\
&= (2\eta - 1 - f)^2 \\
&\geqslant \left(C_{\text{0-1}}(f) - C^*_{\text{0-1}}\right)^2 \quad (b(f) = f)
\end{aligned}
\]</div><p></p><p>于是就能得到<span class="math inline">\(\Gamma(t) = \mathcal{O}(\sqrt{t})\)</span>的平方代理损失的超额风险界;</p>
<p></p><div class="math display">\[R_{\text{0-1}}(f) - R_{\text{0-1}}^*\leqslant (R_{\Phi}(f) - R^*_{\Phi})^{\frac{1}{2}}
\]</div><p></p><p>我们接下来会扩展到更一般的光滑代理的校准结果。</p>
<p><strong>光滑代理损失</strong> 我们考虑形式为<span class="math inline">\(\Phi(f) = F(f) - f\)</span>的光滑代理损失(可加上任意乘性常数缩放及加性常数平移),其中对平方代理损失而言<span class="math inline">\(F(f) = \frac{1}{2}f^2\)</span>,可验证此时</p>
<p></p><div class="math display">\[ \Phi(f) = \frac{1}{2}f^2 - f = \frac{1}{2}(f - 1)^2 - \frac{1}{2}
\]</div><p></p><p>对logistic代理损失而言<span class="math inline">\(F(f) = 2\log (e^{f/2} + e^{-f/2})\)</span>,可验证此时</p>
<p></p><div class="math display">\[\begin{aligned}
\Phi(f) &= 2\log(e^{f/2} + e^{-f/2}) - f\\
&= 2\log(e^{f/2} + e^{-f/2}) - 2\log e^{f/2}\\
&= 2\log(\frac{e^{f/2} + e^{-f/2}}{e^{f/2}})\\
&= 2\log(1 + e^{-f})
\end{aligned}
\]</div><p></p><p>我们假设<span class="math inline">\(F\)</span>为偶函数且<span class="math inline">\(\beta-\)</span>光滑(<span class="math inline">\(\beta > 0\)</span>)这意味着对于所有的<span class="math inline">\(f \in \mathcal{F}\)</span>和<span class="math inline">\(\alpha \in \mathbb{R}\)</span>,有</p>
<p></p><div class="math display">\[F(f) - \alpha f(x) - \inf_{f\in \mathcal{F}}\left\{F(f) - \alpha f\right\}\geqslant \frac{1}{2\beta}|\alpha - F^{\prime}(f)|^2
\]</div><p></p><blockquote>
<p><strong>注</strong> 这里的一种证明方法是使用Fenchel共轭<sup></sup><span class="math inline">\(F^*: \mathcal{F}\rightarrow\mathbb{R}\)</span>,它是<span class="math inline">\((1/\beta)\)</span>-强凸的。我们有</p>
<p></p><div class="math display">\[\begin{aligned}
&F(f) - \alpha f(x) - \inf_{f\in \mathcal{F}}\left\{F(f) - \alpha f\right\}\\
&= \underbrace{F(f) }_{\sup_{\alpha\in \mathbb{R}}\left\{f\alpha - F^*(\alpha)\right\}}- \alpha f(x) + \underbrace{\sup_{f\in \mathcal{F}}\left\{\alpha f - F(f)\right\}}_{F^*(\alpha)}\\
&= F^*(\alpha) - f(x)\alpha + \sup_{\alpha\in \mathbb{R}}\left\{f(x)\alpha - F^*(\alpha)\right\}\\
&= F^*(\alpha) -f(x)\alpha - \inf_{\alpha\in \mathbb{R}}\left\{F^*(\alpha) - f(x)\alpha \right\}\\
&\geqslant \frac{1}{2\beta}\lVert \alpha - F^{\prime}(f)\rVert^2
\end{aligned}
\]</div><p></p></blockquote>
<blockquote>
<p>其中<span class="math inline">\(F^{\prime}(f) = \inf_{\alpha\in \mathbb{R}}\left\{F^*(\alpha) - f(x)\alpha\right\}\)</span>。</p>
</blockquote>
<p>由<span class="math inline">\(\Phi(f) = F(f) - f\)</span>且<span class="math inline">\(F(f)\)</span>为偶函数,代入可得<span class="math inline">\(C_{\Phi}(f) = \eta\Phi(f) + (1 - \eta)\Phi (-f) = F(f) - (2\eta - 1)f(x)\)</span>,因此</p>
<p></p><div class="math display">\[\begin{aligned}
C_{\Phi}(f) - C_{\Phi}^* &= F(f) - (2\eta - 1)f - \inf_f\left\{F(f) - (2\eta - 1)f\right\}\\
&\geqslant \frac{1}{2\beta}(2\eta - 1 - F^{\prime}(f))^2
\\
&\geqslant \frac{1}{2\beta}\left^2
\end{aligned}
\]</div><p></p><p>这将得出<span class="math inline">\(\Gamma(t) = \sqrt{2\beta t} = \mathcal{O}(\sqrt{t})\)</span>的代理损失函数的超额风险界:</p>
<p></p><div class="math display">\[R_{\text{0-1}}(f) - R^*_{\text{0-1}}\leqslant \sqrt{2\beta}(R_{\Phi}(f) - R^*_{\Phi})^{\frac{1}{2}}
\]</div><p></p><p><strong>hinge损失<sup></sup></strong></p>
<p>对hinge损失<span class="math inline">\(\Phi(f) = (1-f)+ = \max\left\{1-f, 0\right\}\)</span>,我们有</p>
<p></p><div class="math display">\[\begin{aligned}
C_{\Phi}(f) = \eta\max\left\{1-f, 0\right\} + (1 - \eta)\max\left\{1+f, 0\right\}
\end{aligned}
\]</div><p></p><p>接下来我们来表示<span class="math inline">\(C^*_{\phi}\)</span>,而这需要对<span class="math inline">\(\eta(x)\)</span>的取值情况进行分类讨论:</p>
<ol>
<li>
<p><span class="math inline">\(\eta(x) = \frac{1}{2}\)</span>: <span class="math inline">\(f^*_{\Phi}(x) = c \in [-1, 1]\)</span>,<span class="math inline">\(C^*_{\Phi} = \frac{1}{2}(1-f) + \frac{1}{2}(1+f) = 1\)</span>。</p>
</li>
<li>
<p><span class="math inline">\(\eta(x) > \frac{1}{2}\)</span>: <span class="math inline">\(f^*(x) = 1\)</span>,<span class="math inline">\(C^*_{\Phi} = 2(1 - \eta)\)</span>。</p>
</li>
<li>
<p><span class="math inline">\(\eta(x) < \frac{1}{2}\)</span>: <span class="math inline">\(f^*(x) = -1\)</span>,<span class="math inline">\(C^*_{\Phi} = 2\eta\)</span>。</p>
</li>
</ol>
<p>综上所述,<span class="math inline">\(C^*_{\Phi} = 2\min\left\{\eta, 1 - \eta\right\}\)</span>。</p>
<p>于是</p>
<p></p><div class="math display">\[C_{\Phi}(f) - C^*_{\Phi} = \eta\max\left\{1-f, 0\right\} + (1 - \eta)\max\left\{1+f, 0\right\} - 2\min\left\{\eta, 1 - \eta\right\}
\]</div><p></p><p>接下来我们需要对<span class="math inline">\(\eta(x)\)</span>和<span class="math inline">\(f(x)\)</span>的取值情况进行分类讨论:</p>
<ol>
<li>
<p><span class="math inline">\(\eta(x) = \frac{1}{2}\)</span>:<br>
此时<span class="math inline">\(C_{\text{0-1}}(f) - C^*_{\text{0-1}} = 0\)</span>,故必有<span class="math inline">\(C_{\Phi}(f) - C^*_{\Phi} \geqslant C_{\text{0-1}}(f) - C^*_{\text{0-1}}\)</span>。</p>
</li>
<li>
<p><span class="math inline">\(\eta(x) > \frac{1}{2}\)</span>:</p>
<p>此时<span class="math inline">\(C^*_{\Phi} = 2( 1- \eta)\)</span>,<span class="math inline">\(C_{\text{0-1}}(f) - C^*_{\text{0-1}} = \left(2\eta - 1\right)\Phi_{\text{0-1}}\left(f\right)\)</span>。</p>
<p>a. 当<span class="math inline">\(f(x) < -1\)</span>时,</p>
<p></p><div class="math display">\[\begin{aligned}
C_{\Phi}(f) - C^*_{\Phi} &= \eta(1 - f) - 2(1 - \eta) \\
&= 2\eta - 1- \underbrace{\left(1 - \eta(1 - f)\right)}_{<0}\\
&\geqslant 2\eta - 1\\
&= C_{\text{0-1}}(f) - C^*_{\text{0-1}}
\end{aligned}
\]</div><p></p><p>b. 当<span class="math inline">\(-1 \leqslant f(x) \leqslant 1\)</span>时,</p>
<p></p><div class="math display">\[\begin{aligned}
C_{\Phi}(f) - C^*_{\Phi} &= 1 + (1 - 2\eta) f - 2(1 - \eta) \\
&= 2\eta - 1- \underbrace{(2\eta - 1)f}_{b(f)}\\
&= |2\eta - 1 - b(f)|\\
&\geqslant C_{\text{0-1}}(f) - C^*_{\text{0-1}}
\end{aligned}
\]</div><p></p><p>c. 当<span class="math inline">\(f(x) > 1\)</span>时,此时<span class="math inline">\(C_{\text{0-1}}(f) - C^*_{\text{0-1}} = 0\)</span>。故必有<span class="math inline">\(C_{\Phi}(f) - C^*_{\Phi} \geqslant C_{\text{0-1}}(f) - C^*_{\text{0-1}}\)</span>。</p>
</li>
<li>
<p><span class="math inline">\(\eta < \frac{1}{2}\)</span>: 类似地,有<span class="math inline">\(C_{\Phi}(f) - C^*_{\Phi} \geqslant C_{\text{0-1}}(f) - C^*_{\text{0-1}}\)</span>。</p>
</li>
</ol>
<p>综上所述,代理损失函数的超额风险界对于<span class="math inline">\(\Gamma\)</span>为恒等函数时成立(也即<span class="math inline">\(\Gamma(t) = t = \mathcal{O}(t)\)</span>)。换句话说,对于hinge损失,我们有</p>
<p></p><div class="math display">\[R_{\text{0-1}}(f) - R^*_{\text{0-1}} \leqslant R_{\Phi}(f) - R^*_{\Phi}
\]</div><p></p><p>也就意味着超额<span class="math inline">\(\Phi\)</span>-风险直接控制超额0-1风险。</p>
<p>对于同样的二分类问题,可以使用不同的凸代理。尽管贝叶斯分类器总是相同的(即<span class="math inline">\(h^* = \mathrm{sign}(2\eta - 1)\)</span>),但大部分代理风险的最小点<span class="math inline">\(f^*\)</span>是不同的。例如,对于hinge损失,<span class="math inline">\(f^*\)</span>正好为<span class="math inline">\(\mathrm{sign}(2\eta - 1)\)</span>,然而对于诸如形式为<span class="math inline">\(\Phi(f) = F(f) - f\)</span>的损失,我们有<span class="math inline">\(F^{\prime}(f) = 2\eta - 1\)</span>,因此对于平方损失,<span class="math inline">\(f^* = 2\eta - 1\)</span>;对于logistic损失,可以验证<span class="math inline">\(f^* = 2 \text{atanh}(2\eta - 1)\)</span>(其中<span class="math inline">\(\text{atanh}\)</span>表示双曲反正切)。</p>
<blockquote>
<p><strong>注</strong> 当函数空间<span class="math inline">\(\mathcal{F}\)</span>足够灵活以致<span class="math inline">\(f\in\mathcal{F}\)</span>可以达到代理风险<span class="math inline">\(R_{\Phi}\)</span>的最小点时,例如核方法或带有足够多神经元的神经网络,最小的0-1风险<span class="math inline">\(R^*_{\text{0-1}}\)</span>是可达的。</p>
<p>然而在实践中,尤其在高维的情况下,使用受限的模型类,尤其是线性模型,达到最小代理风险将不再可能。在这种情况下,我们可以定义基于模型类的一致性<sup></sup>(关于这个可以参见我的博客《学习理论:预测器-拒绝器多分类弃权学习 》)。</p>
</blockquote>
<h1 id="5-超额风险的分解">5 超额风险的分解</h1>
<p>考虑最小化代理损失函数的超额风险<span class="math inline">\(R_{\Phi}(\hat{f}) - R_{\Phi}^{*}\)</span>,这里<span class="math inline">\(\hat{f}\)</span>做为<span class="math inline">\(f^*\)</span>的估计值,是最小化如下所示的经验误差(泛化误差的近似)<sup></sup>来获得的:</p>
<p></p><div class="math display">\[\hat{f} = \inf_{f\in \mathcal{F}}\widehat{R}_{\Phi}(f) = \inf_{f\in \mathcal{F}}\left\{ \frac{1}{m}\sum_{i=1}^m \Phi(y_if(x_i))\right\}
\]</div><p></p><p>而这里的超额风险可以进一步做如下分解<sup></sup>:</p>
<p></p><div class="math display">\[R_{\Phi}(\hat{f}) - R_{\Phi}^{*} = \underbrace{\left(R_{\Phi}(\hat{f}) - \inf_{f\in\mathcal{F}}R_{\Phi}(f)\right)}_{\text{estimation error}\geqslant 0} + \underbrace{\left(\inf_{f\in\mathcal{F}}R_{\Phi}(f) - R_{\Phi}^{*}\right)}_{\text{approximation error}\geqslant 0}
\]</div><p></p><p>其中<span class="math inline">\(\inf_{f\in\mathcal{F}}R_{\Phi}(f)\)</span>为假设类最优(best-in-class)误差,对应的假设类最优假设记为<span class="math inline">\(f^{\circ}\)</span>,假设类最优误差也可以记为<span class="math inline">\(R_{\Phi}(f^{\circ})\)</span></p>
<ul>
<li>
<p>第一项差值被称为<strong>估计误差(estimation error)</strong>,它衡量了一个模型和假设类最优的模型之间的差距。</p>
</li>
<li>
<p>第二项差值是<strong>逼近误差(approximation error)</strong>,代表了假设类<span class="math inline">\(\mathcal{F}\)</span>在多大程度上涵盖了最优实值输出函数<span class="math inline">\(f^*\)</span>,刻画了模型的表达能力。</p>
</li>
</ul>
<p>接下来我们分别来看逼近误差和估计误差。</p>
<h1 id="6-逼近误差">6 逼近误差</h1>
<p>逼近误差<span class="math inline">\(\inf_{f\in\mathcal{F}}R_{\Phi}(f) - R_{\Phi}^{*}\)</span>是确定性的,它依赖于潜在的分布及函数类<span class="math inline">\(\mathcal{F}\)</span>:函数类越“大”,逼近误差越小。</p>
<p>界定逼近误差需要贝叶斯预测器<span class="math inline">\(f^*\)</span>的假设,因此也需要在测试分布上的假设。</p>
<p>在本部分,我们关注<span class="math inline">\(\mathcal{F}=\left\{f_{\theta}, \theta\in \Theta\right\}(\Theta\subset \mathbb{R}^d)\)</span>,以及凸Lipschitz连续的代理损失<span class="math inline">\(\Phi(\cdot)\)</span>,并假设<span class="math inline">\(\theta^*\)</span>是<span class="math inline">\(R_{\Phi}(f_{\theta})\)</span>在<span class="math inline">\(\theta \in \mathbb{R}^d\)</span>的最小点,假设该最小点存在(<span class="math inline">\(\theta^*\)</span>常常并不属于<span class="math inline">\(\Theta\)</span>)。这也就意味着逼近误差可以被分解为</p>
<p></p><div class="math display">\[\inf_{\theta\in\mathcal{\Theta}}R_{\Phi}(f_{\theta}) - R_{\Phi}^* = \left(\inf_{\theta\in\mathcal{\Theta}}R_{\Phi}(f_{\theta}) - \inf_{\theta\in\mathbb{R}^d}R_{\Phi}(f_{\theta})\right) + \left(\inf_{\theta\in\mathbb{R}^d}R_{\Phi}(f_{\theta}) - R_{\Phi}^{*}\right)
\]</div><p></p><ul>
<li>
<p>第二项<span class="math inline">\(\inf_{\theta\in\mathbb{R}^d}R_{\Phi}(f_{\theta}) - R_{\Phi}^{*}\)</span>是由所选择的模型<span class="math inline">\(f_{\theta}\)</span>的集合产生的不可压缩近似误差。对于诸如核方法或神经网络的灵活模型,这个不可压缩误差可以根据需要变小。</p>
</li>
<li>
<p>函数<span class="math inline">\(\theta\mapsto R_{\Phi}(f_{\theta}) - \inf_{\theta\in\mathbb{R}^d}R_{\Phi}(f_{\theta})\)</span>在<span class="math inline">\(\mathbb{R}^d\)</span>上非负,且通常可以被特定的范数<span class="math inline">\(\Omega(\theta - \theta^*)\)</span>(或其平方)界定,因此我们可以将上述的第一项<span class="math inline">\(\inf_{\theta\in\mathcal{\Theta}}R_{\Phi}(f_{\theta}) - \inf_{\theta\in\mathbb{R}^d}R_{\Phi}(f_{\theta})\)</span>看做是<span class="math inline">\(\theta^*\)</span>和<span class="math inline">\(\Theta\)</span>之间某种“距离”的体现。</p>
</li>
</ul>
<p>如果代理损失函数是<span class="math inline">\(G\)</span>-Lipschitz连续的,即存在常数<span class="math inline">\(G > 0\)</span>,对任意<span class="math inline">\(u_1, u_2 \in \mathbb{R}\)</span>,都有<span class="math inline">\(\left|\Phi(u_1) - \Phi(u_2)\right| \leqslant G\left|u_1 - u_2\right|\)</span>成立(参见我的博客《数值优化:算法分类及收敛性分析基础》), 我们有</p>
<p></p><div class="math display">\[\begin{aligned}
R_{\Phi}(f_{\theta_1}) - R_{\Phi}(f_{\theta_2}) &= \mathbb{E}_{x, y}\left[{\Phi}(yf_{\theta_1}(x)) - \Phi(yf_{\theta_2}(x))\right]\\
&\leqslant \mathbb{E}_{x, y}\left\\
&=G\mathbb{E}_{x, y}\left[|f_{\theta_1}(x) - f_{\theta_2}(x)|\right] \quad (y\in \left\{-1, +1\right\})
\end{aligned}
\]</div><p></p><p>因此,逼近误差的第一部分被<span class="math inline">\(G\)</span>和<span class="math inline">\(f_{\theta^{*}}\)</span>和<span class="math inline">\(\mathcal{F}=\left\{f_{\theta}, \theta\in \Theta\right\}\)</span>之间的某种“距离”之积所界定,这里是指某种伪距离<span class="math inline">\((\theta_1, \theta_2)\mapsto \mathbb{E}\left[|f_{\theta_1}(x) - f_{\theta_2}(x)|\right]\)</span>(忽略当且仅当<span class="math inline">\(\theta = \theta^{\prime}\)</span>时该式为0这一性质)。</p>
<h1 id="7-估计误差">7 估计误差</h1>
<p>机器学习理论的一大目标是最小化估计误差<span class="math inline">\(R_{\Phi}(\hat{f}) - \underset{f\in\mathcal{F}}{\inf} R_{\Phi}(f)\)</span>,然而我们前面提到过,<span class="math inline">\(\hat{f}\)</span>做为<span class="math inline">\(f^*\)</span>的估计值,是最小化经验误差来获得的,即<span class="math inline">\(\hat{f} = \underset{f\in \mathcal{F}}{\inf}\widehat{R}_{\Phi}(f)\)</span>。这也就意味着<span class="math inline">\(\hat{f}\)</span>并不能保证最小化泛化误差<span class="math inline">\(R_{\Phi}(\hat{f})\)</span>,自然也就无法保证最小化估计误差了。</p>
<p>于是,考虑对估计误差进行进一步的分解<sup></sup>与界定:</p>
<p></p><div class="math display">\[\begin{aligned}
& R_{\Phi}(\hat{f}) - \inf_{f\in\mathcal{F}}R_{\Phi}(f) \\
&= R_{\Phi}(\hat{f}) - R_{\Phi}(f^{\circ}) \quad (f^{\circ} = \underset{f\in \mathcal{F}}{\mathrm{arg min }}R_{\Phi}(f))\\
&= \underbrace{\left(R_{\Phi}(\hat{f}) - \widehat{R}_{\Phi}(\hat{f})\right)}_{\text{generalization gap}\geqslant 0} + \underbrace{\left(\widehat{R}_{\Phi}(\hat{f}) - \widehat{R}_{\Phi}(f^{\circ})\right)}_{\text{training error} \leqslant 0 \text{ or } \geqslant 0} + \underbrace{\left(\widehat{R}_{\Phi}(f^{\circ}) - R_{\Phi}(f^{\circ})\right)}_{\text{small term} \geqslant 0} \\
&\leqslant \sup_{f\in \mathcal{F}} \left(R_{\Phi}(f)- \widehat{R}_{\Phi}(f)\right) + \left(\widehat{R}_{\Phi}(\hat{f}) - \widehat{R}_{\Phi}(f^{\circ})\right) + \sup_{f\in \mathcal{F}} \left(\widehat{R}_{\Phi}(f)- R_{\Phi}(f)\right) \\
&\leqslant \sup_{f\in \mathcal{F}} \left(R_{\Phi}(f)- \widehat{R}_{\Phi}(f)\right) + 0 + \sup_{f\in \mathcal{F}} \left(\widehat{R}_{\Phi}(f)- R_{\Phi}(f)\right)\\
&\leqslant 2\sup_{f\in \mathcal{F}} \left|\widehat{R}_{\Phi}(f) - R_{\Phi}(f)\right|
\end{aligned}
\]</div><p></p><p>观察上式,我们可以发现:</p>
<ul>
<li>移除<span class="math inline">\(\widehat{R}_{\Phi}\)</span>和<span class="math inline">\(\hat{f}\)</span>之间统计依赖性的关键工具是取<strong>一致界(uniform bound)</strong>,也即上式中的<span class="math inline">\(\sup_{f\in \mathcal{F}}\left(\cdot\right)\)</span>操作。</li>
<li>当<span class="math inline">\(\hat{f}\)</span>并非<span class="math inline">\(\widehat{R}_{\Phi}\)</span>的全局最优点但满足<span class="math inline">\(\widehat{R}_{\Phi}(\hat{f})\leqslant \inf_{f\in\mathcal{F}}\widehat{R}_{\Phi}(f) + \epsilon\)</span>时,则优化误差<span class="math inline">\(\epsilon\)</span>需要被加入本部分所提到的估计误差中。</li>
<li><strong>渐进偏差(uniform deviation)</strong> 随着<span class="math inline">\(\mathcal{F}\)</span>的“大小”而增长,它是一个随机量(由于其依赖于数据),并常随着样本数量<span class="math inline">\(m\)</span>衰减。</li>
<li>这里的关键问题是我们需要对所有<span class="math inline">\(f\in \mathcal{F}\)</span>的一致控制:正如我们后面会涉及的,对于单个<span class="math inline">\(f\)</span>,可以对随机变量<span class="math inline">\(\Phi(yf(x))\)</span>应用集中不等式(如Hoeffding不等式)以得到<span class="math inline">\(\mathcal{O}(1 / \sqrt{m})\)</span>的界。然而,当控制在多个<span class="math inline">\(f\)</span>上的最大偏差时,会存在其中某个偏差变大的可能。因此,我们需要显式控制这种现象。</li>
</ul>
<h1 id="8-估计误差界">8 估计误差界</h1>
<h1 id="81-经验过程与mcdiarmid-不等式的应用">8.1 经验过程与McDiarmid 不等式的应用</h1>
<p>由于估计误差是一个随机量,我们需要使用概率工具来界定它,这时界可以以某个高概率成立。而<span class="math inline">\(\sup_{f\in \mathcal{F}}\lvert \widehat{R}_{\Phi}(f)- R_{\Phi}(f)\lvert\)</span>可以视为函数<span class="math inline">\(f\in \mathcal{F}\)</span>做为下标索引的随机过程的量纲(magnitude),这种随机过程称为经验过程(参见我的博客(《学习理论:代理损失函数的泛化界与Rademacher复杂度》)))。于是证明估计误差有界的问题可转化为证明经验过程有界的问题。</p>
<p>设<span class="math inline">\(A(z_1, \cdots, z_m) = \sup_{f\in \mathcal{F}}\left(\widehat{R}_{\Phi}(f) - R_{\Phi}(f)\right)\)</span>,其中随机变量<span class="math inline">\(z_i = (x_i, y_i)\)</span>独立同分布。假设对于所有的在数据生成分布的支撑集(support)里的<span class="math inline">\((x, y)\)</span>,以及<span class="math inline">\(f\in \mathcal{F}\)</span>,代理损失函数都位于0和某个<span class="math inline">\(M_{\Phi}\)</span>之间(对于大部分代理损失函数而言,这是当函数<span class="math inline">\(f\)</span>有界时的直接结果)。</p>
<blockquote>
<p><strong>注</strong> 直观地理解,概率分布<span class="math inline">\(p\)</span>的支撑集定义为那些概率大于0的点,对于离散概率分布而言为<span class="math inline">\(\left\{x: p(X=x) > 0\right\}\)</span>,对于连续概率分布而言为<span class="math inline">\(\left\{x: p(x) > 0\right\}\)</span><sup></sup>。</p>
</blockquote>
<p>对于单个函数<span class="math inline">\(f\in \mathcal{F}\)</span>,我们可以通过Hoeffding不等式控制<span class="math inline">\(\widehat{R}_{\Phi}(f)\)</span>与<span class="math inline">\(R_{\Phi}(f)\)</span>之间的偏差(这是一个有界独立随机变量的经验均值,与其期望之间的偏差):对于任意<span class="math inline">\(\delta\in (0, 1)\)</span>,至少以<span class="math inline">\(1 - \delta\)</span>的概率有</p>
<p></p><div class="math display">\[\widehat{R}_{\Phi}(f) - R_{\Phi}(f)\leqslant \frac{M_{\Phi}}{\sqrt{2m}}\sqrt{\log \frac{1}{\delta}}
\]</div><p></p><blockquote>
<p><strong>Hoeffding不等式:</strong> <sup></sup> 若<span class="math inline">\(X_1, \cdots, X_m\)</span>为独立随机变量且<span class="math inline">\(X_i\in \)</span>,则有</p>
<p></p><div class="math display">\ \geqslant \epsilon\right) \leqslant \exp\left(-\frac{2m\epsilon^2}{(b - a)^2}\right)
\]</div><p></p></blockquote>
<blockquote>
<p>有时也会用到Hoeffding不等式的另一种表达形式,令<span class="math inline">\(\delta = \exp\left(-\frac{2m\epsilon^2}{(b - a)^2}\right)\)</span>(<span class="math inline">\(\Rightarrow \epsilon = \frac{(b - a)}{\sqrt{2m}}\sqrt{\log\frac{1}{\delta}}\)</span>),则至少以<span class="math inline">\(1 - \delta\)</span>的概率有</p>
<p></p><div class="math display">\[\frac{1}{m}\sum_{i=1}^m X_i- \frac{1}{m}\sum_{i=1}^m\mathbb{E} \leqslant \frac{(b - a)}{\sqrt{2m}}\sqrt{\log\frac{1}{\delta}}
\]</div><p></p></blockquote>
<p>这个控制可以扩展到多个函数<span class="math inline">\(f\)</span>。这里需要用到McDiarmid不等式,而该不等式的应用以有界变差条件为前提,在这里即当将单个变量<span class="math inline">\(z_i\in \mathcal{X}\times\mathcal{Y}\)</span>替换为与训练集独立同分布的<span class="math inline">\(z_i^{\prime}\in \mathcal{X}\times\mathcal{Y}\)</span>后,偏差最多为<span class="math inline">\(\frac{1}{m}M_{\Phi}\)</span>(也即<span class="math inline">\(\left|A(z_1, \cdots, z_m) - A(z_1, \cdots, z_i^{'}, \cdots, z_m)\right|\leqslant \frac{1}{m}M_{\Phi}\)</span>成立,类似的证明可以参见我的博客《学习理论:代理损失函数的泛化界与Rademacher复杂度》)。于是,应用<strong>McDiarmid不等式</strong>可以得到:至少以<span class="math inline">\(1 - \delta\)</span>的概率有</p>
<p></p><div class="math display">\[ A(z_1, \cdots, z_m) \leqslant \mathbb{E}_{z}\left + \frac{M_{\Phi}}{\sqrt{2m}}\sqrt{\log \frac{1}{\delta}}
\]</div><p></p><p>因此,要证明经验过程有界,可以先界定<span class="math inline">\(\sup_{f\in \mathcal{F}} \left(\widehat{R}_{\Phi}(f) - R_{\Phi}(f)\right)\)</span>的期望(或相似的量<span class="math inline">\(\sup_{f\in \mathcal{F}} \left(R_{\Phi}(f)- \widehat{R}_{\Phi}(f)\right)\)</span>的期望,它们通常具有同样的界),然后将单侧界转换为带<span class="math inline">\(\left|\cdot\right|\)</span>的双侧界即可。而这个期望的界定可以使用模型类的Rademacher复杂度(参见我之前的博客)。</p>
<h1 id="82-简化情形-有限数量的模型">8.2 简化情形: 有限数量的模型</h1>
<p>对于有限数量的模型,可以直接将对多个函数的处理转换为单个函数<span class="math inline">\(f\)</span>的处理。在这部分我们假设代理损失函数被界定在0和<span class="math inline">\(M_{\Phi}\)</span>之间。我们可以使用 <strong>联合界(union bound)</strong> 将一致偏差界定:</p>
<p></p><div class="math display">\[\mathbb{P}\left(\sup_{f\in \mathcal{F}} \left|\widehat{R}_{\Phi}(f)- R_{\Phi}(f)\right| \geqslant \epsilon\right) \leqslant \sum_{f\in \mathcal{F}} \mathbb{P}\left(\left|\widehat{R}_{\Phi}(f)- R_{\Phi}(f)\right| \geqslant \epsilon\right)
\]</div><p></p><p>因此,对于固定的<span class="math inline">\(f\in \mathcal{F}\)</span>,<span class="math inline">\(\widehat{R}_{\Phi}(f)\)</span>,可以应用双侧Hoeffding's不等式来界定每一个<span class="math inline">\(\mathbb{P}\left(\left|\widehat{R}_{\Phi}(f)- R_{\Phi}(f)\right| \geqslant \epsilon\right)\)</span>以得到</p>
<p></p><div class="math display">\[\begin{aligned}
\mathbb{P}\left(\sup_{f\in \mathcal{F}} \left|\widehat{R}_{\Phi}(f)- R_{\Phi}(f)\right|
\geqslant \epsilon\right) &\leqslant \sum_{f\in \mathcal{F}}2\exp(-2mt^2 / M_{\Phi}^2)\\
&= 2\left|\mathcal{F}\right|\exp(-2mt^2 / M_{\Phi}^2)
\end{aligned}
\]</div><p></p><blockquote>
<p><strong>双侧Hoeffding不等式:</strong> <sup></sup> 若<span class="math inline">\(X_1, \cdots, X_m\)</span>为独立随机变量且<span class="math inline">\(X_i\in \)</span>,则有</p>
<p></p><div class="math display">\\right| \geqslant \epsilon\right) \leqslant 2\exp\left(-\frac{2m\epsilon^2}{(b - a)^2}\right)
\]</div><p></p></blockquote>
<p>因此,令<span class="math inline">\(\delta = 2\left|\mathcal{F}\right|\exp(-2mt^2 / M_{\Phi}^2)\)</span>(<span class="math inline">\(\Rightarrow \epsilon = \frac{M_{\Phi}}{\sqrt{2m}}\sqrt{\log \frac{2\left|\mathcal{F}\right|}{\delta}}\)</span>),得到至少以<span class="math inline">\(1 - \delta\)</span>的概率有:</p>
<p></p><div class="math display">\[\begin{aligned}
\sup_{f\in \mathcal{F}} \left|\widehat{R}_{\Phi}(f)- R_{\Phi}(f)\right| &\leqslant \epsilon = \frac{M_{\Phi}}{\sqrt{2m}}\sqrt{\log \frac{2\left|\mathcal{F}\right|}{\delta}}\\
&= \frac{M_{\Phi}}{\sqrt{2m}}\sqrt{\log(2\left|F\right|) + \log\frac{1}{\delta}}\\
&\leqslant M_{\Phi}\sqrt{\frac{\log(2\left|F\right|)}{2m}} + \frac{M_{\Phi}}{\sqrt{2m}}\sqrt{\log \frac{1}{\delta}}\quad \left(\sqrt{a + b}\leqslant \sqrt{a} + \sqrt{b}\right)
\end{aligned}
\]</div><p></p><p>其中最后一行不等式我们使用了平方根函数的次可加性(sub-additivity)<sup></sup>。这是一个一致偏差的上界。根据上述界,当模型类“大小”的对数<span class="math inline">\(\log(\left|\mathcal{F}\right|)\)</span>显著小于<span class="math inline">\(m\)</span>时,学习是可能的。这是一个对一致偏差的通用控制方法。</p>
<blockquote>
<p><strong>注</strong> 这个上界对于有限的模型类成立。</p>
</blockquote>
<h1 id="83-通过覆盖数超越有限多的模型">8.3 通过覆盖数超越有限多的模型</h1>
<p>简单地说,<strong>覆盖数(covering number)</strong> <sup></sup>背后的思想是通过无限多的元素来处理函数空间,而这是通过使用有限多的元素来对其进行近似而达到的。这也常被称为“<span class="math inline">\(\epsilon\)</span>-网论证”。</p>
<p>这里我们假设代理损失函数是<span class="math inline">\(G\)</span>-Lipschitz连续的。因此,正像在第6部分提到过的,对任意<span class="math inline">\(f_1, f_2\in \mathcal{F}\)</span>,我们有</p>
<p></p><div class="math display">\[\left|R_{\Phi}(f_1) - R_{\Phi}(f_2)\right| \leqslant G\cdot \mathbb{E}\left[|f_1(x) - f_2(x)|\right] = G\cdot \Delta(f_1, f_2)
\]</div><p></p><p><strong>定义</strong>(覆盖数) 我们假设有<span class="math inline">\(N\)</span>个元素<span class="math inline">\(f_1, \cdots, f_N\)</span>使得对任意<span class="math inline">\(f\in \mathcal{F}\)</span>,存在<span class="math inline">\(i\in \left\{1, \cdots, N\right\}\)</span>,使得<span class="math inline">\(\Delta(f, f_i)\leqslant \epsilon\)</span>(<span class="math inline">\(d\)</span>按上式定义)。最小的<span class="math inline">\(N\)</span>的可能值是<span class="math inline">\(\mathcal{F}\)</span>在精度为<span class="math inline">\(\epsilon\)</span>时的<strong>覆盖数(Cover number)</strong>,记为<span class="math inline">\(\mathcal{N}(\mathcal{F}, \Delta, \epsilon)\)</span>。下面展示了在二维的情况下使用欧几里得球的覆盖:</p>
<p align="center">
<img src="https://images.cnblogs.com/cnblogs_com/blogs/538207/galleries/2445565/o_251126095326_%E4%BA%8C%E7%BB%B4%E6%83%85%E5%86%B5%E4%B8%8B%E4%BD%BF%E7%94%A8%E6%AC%A7%E5%87%A0%E9%87%8C%E5%BE%97%E7%90%83%E7%9A%84%E8%A6%86%E7%9B%96.png" width="200" height="200" alt="" align="center">
</p>
<p>覆盖数<span class="math inline">\(\mathcal{N}(\mathcal{F}, \Delta, \epsilon)\)</span>可视为<span class="math inline">\(\epsilon\)</span>的非增函数。通常情况下,当<span class="math inline">\(\epsilon\rightarrow 0\)</span>时,<span class="math inline">\(\mathcal{N}(\mathcal{F}, \Delta, \epsilon)\)</span>随着<span class="math inline">\(\epsilon\)</span>以指数<span class="math inline">\(\epsilon^{-d}\)</span>变化,这里<span class="math inline">\(d\)</span>是潜在维度。实际上,对于<span class="math inline">\(f\)</span>关于<span class="math inline">\(M_{\Phi}\)</span>的有界条件而言,如果<span class="math inline">\(\mathcal{F}\)</span>(在某种特定的参数化下)被包含在维度为<span class="math inline">\(d\)</span>的<span class="math inline">\(M_{\Phi}\)</span>-球中的半径为<span class="math inline">\(c\)</span>的球中,它就可以被<span class="math inline">\((c/\epsilon)^d\)</span>个长度为<span class="math inline">\(2\epsilon\)</span>的立方体覆盖(若<span class="math inline">\(c \geqslant \epsilon\)</span>),如下图所示:</p>
<p align="center">
<img src="https://images.cnblogs.com/cnblogs_com/blogs/538207/galleries/2445565/o_251126100312_%E4%BD%BF%E7%94%A8%E7%AB%8B%E6%96%B9%E4%BD%93%E7%9A%84%E8%A6%86%E7%9B%96.png" width="200" height="200" alt="" align="center">
</p>
<p>推广到所有在<span class="math inline">\(d\)</span>维情况下等价的距离<span class="math inline">\(\Delta\)</span>,对于有限维向量空间的任意有界子集<span class="math inline">\(\mathcal{F}\)</span>,我们仍然有<span class="math inline">\(\mathcal{N}(\mathcal{F}, \Delta, \epsilon)\sim \epsilon^{-d}\)</span>,取对数可得<span class="math inline">\(\log \mathcal{N}(\mathcal{F}, \Delta, \epsilon)\sim d\log\frac{1}{\epsilon}\)</span>。</p>
<p>对于某些集合<span class="math inline">\(\mathcal{F}\)</span>(例如在<span class="math inline">\(d\)</span>维情况下关于有界Lipschitz常量的Lipschitz连续函数),<span class="math inline">\(\log \mathcal{N}(\mathcal{F}, \Delta, \epsilon)\)</span>增长得更快,例如以<span class="math inline">\(\epsilon^{-d}\)</span>的阶增长。</p>
<p><strong><span class="math inline">\(\epsilon\)</span>-网论证</strong> 给定<span class="math inline">\(\mathcal{F}\)</span>的一个覆盖,对所有<span class="math inline">\(f\in \mathcal{F}\)</span>和与之相关的覆盖元素<span class="math inline">\((f_i)_{i\in 1, \cdots, \mathcal{N}(\mathcal{F}, \Delta, \epsilon)}\)</span>,由<span class="math inline">\(\widehat{R}\)</span>和<span class="math inline">\(R\)</span>都关于距离<span class="math inline">\(\Delta\)</span>满足<span class="math inline">\(G\)</span>-Lipschitz连续,于是对任意<span class="math inline">\(i\in \left\{1, \cdots, \mathcal{N}(\mathcal{F}, \Delta, \epsilon)\right\}\)</span>,我们有</p>
<p></p><div class="math display">\[\begin{aligned}
\left|\widehat{R}_{\Phi}(f) - R_{\Phi}(f)\right| &\leqslant \left|\widehat{R}_{\Phi}(f) - \widehat{R}_{\Phi}(f_i)\right| + \left|\widehat{R}_{\Phi}(f_i) - R_{\Phi}(f_i)\right| +\left|R_{\Phi}(f_i) - R_{\Phi}(f)\right| \\
&\leqslant 2G\cdot \Delta(f, f_i) + \left|\widehat{R}_{\Phi}(f_i) - R_{\Phi}(f_i)\right| \\
&\leqslant2G\cdot \Delta(f, f_i) + \sup_{j\in \left\{1, \cdots, \mathcal{N}(\mathcal{F}, \Delta, \epsilon)\right\}}\left|\widehat{R}_{\Phi}(f_j) - R_{\Phi}(f_j)\right| \\
\end{aligned}
\]</div><p></p><p>由于<span class="math inline">\((f_i)_{i\in 1, \cdots, \mathcal{N}(\mathcal{F}, \Delta, \epsilon)}\)</span>满足覆盖性质,于是我们可以用<span class="math inline">\(\epsilon\)</span>将<span class="math inline">\(\Delta(f, f_i)\)</span>界定:</p>
<p></p><div class="math display">\[\left|\widehat{R}_{\Phi}(f) - R_{\Phi}(f)\right| \leqslant 2G\epsilon +\sup_{j\in \left\{1, \cdots, \mathcal{N}(\mathcal{F}, \Delta, \epsilon)\right\}}\left|\widehat{R}_{\Phi}(f_j) - R_{\Phi}(f_j)\right|
\]</div><p></p><p>这也就意味着,若使用7.2部分的结论,则以大于<span class="math inline">\(1 - \delta\)</span>的概率有</p>
<p></p><div class="math display">\[\sup_{f\in \mathcal{F}}\left|\widehat{R}_{\Phi}(f) - R_{\Phi}(f)\right| \leqslant 2G\epsilon +M_{\Phi}\sqrt{\frac{\log\left(2\mathcal{N}(\mathcal{F}, \Delta, \epsilon)\right)}{2m}} + \frac{M_{\Phi}}{\sqrt{2m}}\sqrt{\log\frac{1}{\delta}}
\]</div><p></p><p>因此,若<span class="math inline">\(\mathcal{N}(\mathcal{F}, \Delta, \epsilon)\sim \epsilon^{-d}\)</span>,忽略常数的话,我们还需要界定<span class="math inline">\(\epsilon + \sqrt{d\log(1 / \epsilon) / m}\)</span>。取<span class="math inline">\(\epsilon \propto 1 / \sqrt{m}\)</span>可以得到<span class="math inline">\(\mathcal{O}\left(\sqrt{(d/m)\log (m)}\right)\)</span>的率。和7.2部分有限模型类的情况有点像,这里和<span class="math inline">\(m\)</span>的关系也接近于<span class="math inline">\(1 / \sqrt{m}\)</span>。然而,除非重新定义覆盖数的计算或者使用更高级的工具(比如<strong>链(chaining)</strong>),这通常会导致非最优的维度项与/或样本数量项的出现。</p>
<p>有两个有力的工具可以以合理的代价导出更好的界,它们是<strong>Rademacher复杂度</strong>和<strong>Gaussian复杂度</strong>。其中Rademacher复杂度可以参见我的博客《学习理论:代理损失函数的泛化界与Rademacher复杂度》。</p>
<h1 id="参考">参考</h1>
<ul>
<li> Bach, Francis. Learning theory from first principles. MIT press, 2024.</li>
<li> 【COLT 2025】Conference on Learning Theory</li>
<li> Mohri M, Rostamizadeh A, Talwalkar A. Foundations of machine learning. MIT press, 2018.</li>
<li> John Duchi: Statistics and Information Theory</li>
<li> 周志华, 王魏, 高尉, 张利军. 机器学习理论导引. 机械工业出版社, 2020.</li>
<li> Han Bao: Learning Theory Bridges Loss Functions</li>
<li> Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.</li>
<li> Long, Phil, and Rocco Servedio. "Consistency versus realizable H-consistency for multiclass classification." International conference on machine learning. PMLR, 2013.</li>
<li> Ni C, Charoenphakdee N, Honda J, et al. On the calibration of multiclass classification with rejection. Advances in Neural Information Processing Systems, 2019, 32.</li>
<li> Wikipedia: Support (mathematics)</li>
<li> Vershynin R. High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018.</li>
<li> Wikipedia: Subadditivity</li>
</ul>
</div>
<div id="MySignature" role="contentinfo">
数学是符号的艺术,音乐是上界的语言。<br><br>
来源:https://www.cnblogs.com/orion-orion/p/19290276
頁:
[1]