彩笔运维勇闯机器学习--梯度下降法
<h2 id="前言">前言</h2><p>彩笔运维勇闯机器学习,今天我们来讨论一下梯度下降法</p>
<h2 id="梯度">梯度</h2>
<p>首先要搞明白什么是梯度,那就要先从导数说起</p>
<h4 id="导数">导数</h4>
<p>函数<span class="math inline">\(y=f(x)\)</span>的自变量<span class="math inline">\(x\)</span>在一点<span class="math inline">\(x_0\)</span>上产生一个增量<span class="math inline">\(\Delta x\)</span>时,函数输出值的增量<span class="math inline">\(\Delta y=f(x_0 + \Delta x)-f(x_0)\)</span>与自变量增量<span class="math inline">\(\Delta x\)</span>的比值在<span class="math inline">\(\Delta x\)</span>趋于0时的极限<span class="math inline">\(a\)</span>如果存在,<span class="math inline">\(a\)</span>即为在<span class="math inline">\(x_0\)</span>处的导数</p>
<p></p><div class="math display">\[f'(x) = \frac{\partial y}{\partial x} = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} = a
\]</div><p></p><p><img alt="gradient_1" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250917110850489-1176985183.png" class="lazyload"></p>
<p>(该图来自于百度百科)</p>
<h4 id="偏导数">偏导数</h4>
<p>偏导数与导数的本质是一样的,只不过偏导数解决的是多变量的问题</p>
<p></p><div class="math display">\[\frac{\partial f}{\partial x_i} = \lim_{\Delta x_i \to 0} \frac{f(x_1, \dots, x_i + \Delta x_i, \dots, x_n) - f(x_1, \dots, x_i, \dots, x_n)}{\Delta x_i}
\]</div><p></p><p>比如二元函数<span class="math inline">\(f(x,y)\)</span></p>
<p>对<span class="math inline">\(x\)</span>求偏导:</p>
<p></p><div class="math display">\[\frac{\partial f}{\partial x} = \lim_{\Delta x \to 0} \frac{f(x + \Delta x, y) - f(x, y)}{\Delta x}
\]</div><p></p><p>对<span class="math inline">\(y\)</span>求偏导:</p>
<p></p><div class="math display">\[\frac{\partial f}{\partial y} = \lim_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x, y)}{\Delta y}
\]</div><p></p><p><img alt="gradient_2" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250917110924720-1621633974.png" class="lazyload"></p>
<p>(该图来自于百度百科)</p>
<p>超过二元的就画不出来图来了</p>
<h4 id="方向导数">方向导数</h4>
<p>导数与偏导数都是自变量相对于某一轴方向(比如x相对于x轴,y相对于y轴)讨论变化率,而方向导数讨论的则是,自变量可以在其定义域内自由选择方向</p>
<p>一个多元函数<span class="math inline">\(f\)</span>和一个方向向量<span class="math inline">\(u\)</span>,方向导数<span class="math inline">\(D_uf\)</span>表示函数<span class="math inline">\(f\)</span>在<span class="math inline">\(u\)</span>方向的变化率</p>
<p></p><div class="math display">\[D_{\mathbf{u}}f(a) = \lim_{h \to 0} \frac{f(a + hu) - f(a)}{h}
\]</div><p></p><p>用二元函数举例:</p>
<p></p><div class="math display">\[D_{\mathbf{u}}f(x_0, y_0) = \lim_{h \to 0} \frac{f(x_0 + h u_1, y_0 + h u_2) - f(x_0, y_0)}{h}
\]</div><p></p><ul>
<li><span class="math inline">\(u_1 u_2\)</span>表示方向<span class="math inline">\(u = (u1, u2)\)</span></li>
<li>h表示沿着<span class="math inline">\(u\)</span>方向的位移</li>
</ul>
<h4 id="梯度-1">梯度</h4>
<p>梯度是多元函数在某一点处所有偏导数构成的向量,表示函数在该点处变化最快的方向及其变化率,对于<span class="math inline">\(f(x_1,x_2,...,x_n)\)</span>,其梯度记为<span class="math inline">\(\nabla f\)</span></p>
<p></p><div class="math display">\[\nabla f = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n}\right)
\]</div><p></p><ul>
<li>方向:梯度指向函数在该点增长最快的方向</li>
<li>大小:梯度的模表示函数在该方向上的最大变化率</li>
</ul>
<p>方向导数:梯度与单位方向向量u的点积</p>
<p></p><div class="math display">\[D_uf=\nabla f⋅u
\]</div><p></p><p>当<span class="math inline">\(u\)</span>与<span class="math inline">\(\nabla f\)</span>同方向时,方向导数最大<br>
当<span class="math inline">\(u\)</span>与<span class="math inline">\(\nabla f\)</span>反方向时,方向导数最小</p>
<h4 id="小结">小结</h4>
<p>梯度与方向导数:梯度是方向导数中变化率最大的那一个:梯度的方向是方向导数取最大值的方向,而梯度的模长(大小)等于该最大方向导数的值<br>
方向导数与偏导数:偏导数是方向导数的特例,即<span class="math inline">\(u=(0,1)\)</span>,简而言之,方向导数在坐标轴上移动,就是偏导数</p>
<p>了解了梯度的诞生以及概念之后,终于可以来讨论一下本文的主题:梯度下降法</p>
<h2 id="梯度下降法">梯度下降法</h2>
<p>在回归任务中,用于评估模型的重要指标是损失函数MSE,提高模型的泛化能力就是设法降低MSE</p>
<p>上述关于梯度的描述,梯度就是函数变化率最快的方向,那梯度下降法就是不断沿着付梯度方向寻找MSE的最小值。不同于最小二乘法只能用于线性模型,梯度下降法适用于大部分模型,包括线性回归、逻辑回归等等</p>
<h4 id="核心思想">核心思想</h4>
<ul>
<li>函数的梯度指向函数值增长最快的方向,负梯度方向则是函数值下降最快的方向</li>
<li>通过不断沿负梯度方向调整参数,逐步逼近函数的最小值点</li>
</ul>
<h4 id="步骤">步骤</h4>
<ul>
<li>初始化,随机选择初始参数或者全部设置为0</li>
<li>迭代更新,每迭代一次都会更新参数的值:$$\theta_{t+1}=\theta_t - \eta ⋅ \nabla f(\theta_t)$$
<ul>
<li><span class="math inline">\(\theta_t\)</span>:第<span class="math inline">\(t\)</span>次迭代的参数值</li>
<li><span class="math inline">\(\eta\)</span>:每次更新的幅度,也叫学习率</li>
<li><span class="math inline">\(\nabla J(\theta_t)\)</span>:目标函数<span class="math inline">\(f\)</span>在<span class="math inline">\(\theta_t\)</span>的梯度</li>
</ul>
</li>
<li>终止迭代的条件:
<ul>
<li>梯度很小:梯度的模长很小,一般小于<span class="math inline">\(10^{-6}\)</span></li>
<li>损失函数变化很小:一般小于<span class="math inline">\(10^{-6}\)</span></li>
<li>到达最大迭代次数</li>
</ul>
</li>
</ul>
<h4 id="计算过程">计算过程</h4>
<p>我们用本系列的第一节:一元线性回归中的数据,用梯度下降法详细演示一次</p>
<pre><code>data = {
'result': ,
'feature1':
}
</code></pre>
<p>目标是找到一组参数,使得损失函数MSE最小:</p>
<p></p><div class="math display">\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat {y}_i)^2
\]</div><p></p><p>带入<span class="math inline">\(y=\beta_0 + \beta_1 x\)</span></p>
<p></p><div class="math display">\[f(\beta_0 , \beta_1) = \frac{1}{n} \sum_{i=1}^{n} (\beta_0 + \beta_1 x_i - \hat {y}_i)^2
\]</div><p></p><p>首先计算梯度,分别对<span class="math inline">\(\beta_0\)</span>、<span class="math inline">\(\beta_1\)</span>求偏导</p>
<p>先对<span class="math inline">\(\beta_0\)</span>求偏导:</p>
<p></p><div class="math display">\[\frac{\partial f}{\partial β_0} = \frac{1}{n} \sum_{i=1}^{n} 2(β_0 + β_1x_i - \hat {y}_i)⋅(β_0 + β_1x_i - \hat {y}_i)' = \frac{2}{n} \sum_{i=1}^{n} (β_0 + β_1x_i - \hat {y}_i)
\]</div><p></p><p>在对<span class="math inline">\(\beta_1\)</span>求偏导:</p>
<p></p><div class="math display">\[\frac{\partial f}{\partial β_1} = \frac{1}{n} \sum_{i=1}^{n} 2(β_0 + β_1x_i - \hat {y}_i)⋅(β_0 + β_1x_i - \hat {y}_i)' = \frac{2}{n} \sum_{i=1}^{n} (β_0 + β_1x_i - \hat {y}_i)⋅x_i
\]</div><p></p><p>至此得出梯度:</p>
<p></p><div class="math display">\[\nabla f = (\frac{2}{n} \sum_{i=1}^{n} (β_0 + β_1x_i - \hat {y}_i), \frac{2}{n} \sum_{i=1}^{n} (β_0 + β_1x_i - \hat {y}_i)⋅x_i)
\]</div><p></p><p>设置参数,学习率<span class="math inline">\(\eta=0.001\)</span>,迭代次数100次,开始迭代:</p>
<p>1)第一轮迭代,先初始化<span class="math inline">\(\beta_0\)</span> <span class="math inline">\(\beta_1\)</span>为0</p>
<p>计算损失函数:</p>
<p></p><div class="math display">\[\begin{aligned}
MSE &= f(\beta_0 , \beta_1) = \frac{1}{n} \sum_{i=1}^{n} (\beta_0 + \beta_1 x_i - \hat {y}_i)^2 \\
&= \frac{1}{8}[(0-0.63)^2+(0-0.72)^2+...+(0-0.47)^2] = 0.35965
\end{aligned}
\]</div><p></p><p>计算梯度:</p>
<p></p><div class="math display">\[\begin{aligned}
\frac{\partial f}{\partial β_0} &= \frac{2}{n} \sum_{i=1}^{n} (β_0 + β_1x_i - \hat {y}_i) \\
&= \frac{1}{4}[(0-0.63)+(0-0.72)+...+(0-0.47)] = -1.185
\end{aligned}
\]</div><p></p><p></p><div class="math display">\[\begin{aligned}
\frac{\partial f}{\partial β_1} &= \frac{2}{n} \sum_{i=1}^{n} (β_0 + β_1x_i - \hat {y}_i)⋅x_i \\
&= \frac{1}{4}[(0-0.63)·22.48+(0-0.72)·19.50+...+(0-0.47)·13.24] = -20.418025
\end{aligned}
\]</div><p></p><p></p><div class="math display">\[\nabla f = (-1.185, -20.418025)
\]</div><p></p><p>损失函数与梯度均小于<span class="math inline">\(10^{-6}\)</span>,继续迭代第二轮</p>
<p>2)第二轮迭代,先初始化<span class="math inline">\(\beta_0\)</span> <span class="math inline">\(\beta_1\)</span></p>
<p></p><div class="math display">\[\beta_0 ← \beta_0 - \eta · \frac{\partial f}{\partial β_0} = 0 - 0.001·(-1.185) = 0.001185
\]</div><p></p><p></p><div class="math display">\[\beta_1 ← \beta_1 - \eta · \frac{\partial f}{\partial β_1} = 0 - 0.001·(-20.418025) = 0.020418025
\]</div><p></p><p>计算损失函数:</p>
<p></p><div class="math display">\[MSE = f(\beta_0 , \beta_1) = \frac{1}{n} \sum_{i=1}^{n} (\beta_0 + \beta_1 x_i - \hat {y}_i)^2 = 0.064501 \\
\]</div><p></p><p>计算梯度:</p>
<p></p><div class="math display">\[\frac{\partial f}{\partial β_0} = \frac{2}{n} \sum_{i=1}^{n} (β_0 + β_1x_i - \hat {y}_i) = -0.492909
\]</div><p></p><p></p><div class="math display">\[\frac{\partial f}{\partial β_1} = \frac{2}{n} \sum_{i=1}^{n} (β_0 + β_1x_i - \hat {y}_i)⋅x_i = -8.395268
\]</div><p></p><p></p><div class="math display">\[\nabla f = (-0.492909, -8.395268)
\]</div><p></p><p>损失函数与梯度均小于<span class="math inline">\(10^{-6}\)</span>,继续迭代第三轮...</p>
<p>就这样不断迭代下去,直至满足停止的条件,停止之后,该轮次的<span class="math inline">\(β_0\)</span> <span class="math inline">\(β_1\)</span>就是最佳参数</p>
<h2 id="_"></h2>
<h2 id="联系我">联系我</h2>
<ul>
<li>联系我,做深入的交流</li>
</ul>
<p><img alt="" width="500" height="200" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202411/1416773-20241121135740959-1907948957.png#" class="lazyload"></p>
<hr>
<p>至此,本文结束<br>
在下才疏学浅,有撒汤漏水的,请各位不吝赐教...</p>
</div>
<div id="MySignature" role="contentinfo">
<p>本文来自博客园,作者:it排球君,转载请注明原文链接:https://www.cnblogs.com/MrVolleyball/p/19096369</p>
<div>本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须在文章页面给出原文连接,否则保留追究法律责任的权利。 </div><br><br>
来源:https://www.cnblogs.com/MrVolleyball/p/19096369
頁:
[1]