彩笔运维勇闯机器学习--逻辑回归
<h2 id="前言">前言</h2><p>从本节开始,我们的机器学习之旅进入了下一个篇章。之前讨论的是回归算法,回归算法主要用于预测数据。而本节讨论的是分类问题,简而言之就是按照规则将数据分类</p>
<p>而要讨论的逻辑回归,虽然名字叫做回归,它要解决的是分类问题</p>
<h2 id="开始探索">开始探索</h2>
<h4 id="scikit-learn">scikit-learn</h4>
<p>还是老规矩,先来个例子,再讨论原理</p>
<p>假设以下场景:一位老哥想要测试他老婆对于抽烟忍耐度,他进行了以下测试</p>
<table>
<thead>
<tr>
<th></th>
<th>星期一</th>
<th>星期二</th>
<th>星期三</th>
<th>星期四</th>
<th>星期五</th>
<th>星期六</th>
<th>星期日</th>
</tr>
</thead>
<tbody>
<tr>
<td>抽烟(单位:根)</td>
<td>6</td>
<td>18</td>
<td>14</td>
<td>13</td>
<td>5</td>
<td>10</td>
<td>8</td>
</tr>
<tr>
<td>是否被老婆打</td>
<td>否</td>
<td>是</td>
<td>是</td>
<td>是</td>
<td>否</td>
<td>是</td>
<td>否</td>
</tr>
</tbody>
</table>
<p>将以上情形带入模型</p>
<pre><code>from sklearn.linear_model import LogisticRegression
import numpy as np
X = np.array().reshape(-1, 1)
y = np.array()
model = LogisticRegression()
model.fit(X, y)
print(f"系数: {model.coef_:.4f}")
print(f"截距: {model.intercept_:.4f}")
decision_boundary = -model.intercept_ / model.coef_
print(f"决策边界: {decision_boundary:.2f}")
</code></pre>
<p>脚本!启动:</p>
<p><img alt="watermarked-logistic_regression_1_1" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140003063-1414962719.png" class="lazyload"></p>
<h4 id="报告解读">报告解读</h4>
<p>单特征影响结果,这明显是一个线性模型,所以出现了熟悉的系数与截距,还有一个新的参数:决策边界,这意味着9.1就是分类阈值,>=9.1的结果分类为1,<9.1为0</p>
<p>带入到情景当中,每天9根烟以上,要被老婆打,否则不打</p>
<h2 id="深入理解逻辑回归">深入理解逻辑回归</h2>
<h4 id="与线性回归比较">与线性回归比较</h4>
<p>那位大哥说了,怎么和线性回归这么相似,但是最后又有一点不同</p>
<ul>
<li>逻辑回归是将线性回归的输出,再通过函数映射成概率值(0~1之间),再进行分类</li>
<li>线性回归的损失函数是MSE,而逻辑回归的损失函数则是平均交叉熵</li>
<li>线性回归的回归系数算法可以用最小二乘法或者梯度算法(之前没有介绍过),逻辑回归只能用梯度算法</li>
<li>还有很多不同,包括但不限:评估模型、使用场景、目标函数等都不一样</li>
</ul>
<p>总之,逻辑回归虽然也有“回归”2字,但是主要还是更适合分类问题</p>
<h4 id="数学模型">数学模型</h4>
<p>逻辑回归通过将线性回归的输出映射到概率值(0到1之间),利用Sigmoid函数(或称逻辑函数)实现分类</p>
<p></p><div class="math display">\[\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} \quad , z = \mathbf{w}^\top \mathbf{x} + b
\]</div><p></p><p>w 是权重向量,b是偏置项,X 是输入特征向量</p>
<p></p><div class="math display">\[z \to \infty,\sigma(z) \to 1
\]</div><p></p><p></p><div class="math display">\[z \to -\infty,\sigma(z) \to 0
\]</div><p></p><p>通过该函数,把线性方程的值域从<span class="math inline">\((-\infty,+\infty)\)</span>,修改为概率的值域<span class="math inline">\(\)</span></p>
<h4 id="损失函数">损失函数</h4>
<p>与线性回归的mse不同,逻辑回归使用的损失函数为平均交叉熵</p>
<p></p><div class="math display">\[\mathcal{L} = - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]
\]</div><p></p><pre><code>from sklearn.metrics import log_loss
y_proba = model.predict_proba(X)[:, 1]
loss_sklearn = log_loss(y, y_proba)
print('=='*20)
print(f"损失函数(Log Loss): {loss_sklearn:.4f}")
</code></pre>
<p><img alt="watermarked-logistic_regression_1_5" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140057482-1235839835.png" class="lazyload"></p>
<ul>
<li>值接近0,预测概率接近真实</li>
<li>值越大,预测概率错误或不确定</li>
<li>趋于<span class="math inline">\(+\infty\)</span>,极端错误(比如预测为1但是0)</li>
</ul>
<h4 id="模型评估">模型评估</h4>
<ul>
<li>
<p>准确率:顾名思义,分类的准确率</p>
<pre><code>from sklearn.metrics import accuracy_score
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)
print('=='*20)
print(f"准确率:{accuracy:.2f}")
</code></pre>
<p><img alt="watermarked-logistic_regression_1_1" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140003063-1414962719.png" class="lazyload"></p>
</li>
<li>
<p>混淆矩阵:对于一个二分类(二元问题,最后的结果可以用0、1来分类)问题,混淆矩阵是一个 2×2 的矩阵,包含以下四个关键指标</p>
<ul>
<li>真正例(TP):模型正确预测为正例的样本数。比如例子中的“挨打”</li>
<li>假负例(FN):模型错误预测为正例的样本数(误报)。例子中错误判断为“挨打”</li>
<li>假正例(FP):模型错误预测为负例的样本数(漏报)。例子中错误判断为“没有挨打”</li>
<li>真负例(TN):模型正确预测为负例的样本数。比如例子中的“没有挨打”</li>
</ul>
<pre><code>[# TN=3, FP=1
] # FN=1, TP=3
</code></pre>
<pre><code>from sklearn.metrics import confusion_matrix
print('=='*20)
print('混淆矩阵:')
y_pred = model.predict(X)
cm = confusion_matrix(y, y_pred)
print(cm)
</code></pre>
<p><img alt="watermarked-logistic_regression_1_3" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140227051-420257751.png" class="lazyload"></p>
<p>从混淆矩阵中产生了一系列评估指标:</p>
<ul>
<li>准确率(accuracy):模型预测正确的比例 <span class="math inline">\(\frac{TP+TN}{TP+TN+FP+FN}\)</span></li>
<li>精确率(precision):预测为正例的样本中,真实为正例的比例 <span class="math inline">\(\frac{TP}{TP+FP}\)</span></li>
<li>召回率(recall):真实为正例的样本中,被正确预测的比例 <span class="math inline">\(\frac{TP}{TP+FN}\)</span></li>
<li>特异度(specificity):真实为负例的样本中,被正确预测的比例 <span class="math inline">\(\frac{TN}{TN+FP}\)</span></li>
<li>F1分数:精确率和召回率的调和平均数 <span class="math inline">\(2⋅\frac{精确率\times召回率}{精确率+召回率}\)</span></li>
</ul>
<p><img alt="watermarked-logistic_regression_1_4" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140235506-2127187459.png" class="lazyload"></p>
<p>或者直接使用<code>classification_report</code>:</p>
<pre><code>from sklearn.metrics import classification_report
print('=='*20)
y_pred = model.predict(X)
print("Logistic Regression 分类报告:\n", classification_report(y, y_pred))
</code></pre>
<p><img alt="watermarked-logistic_regression_1_10" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140243297-710023540.png" class="lazyload"></p>
</li>
<li>
<p>ROC-AUC</p>
<ul>
<li>ROC(受试者工作特征)曲线与AUC(曲线下面积),在类别不平衡的场景中广泛使用。所谓类别不平衡,就是在样本中类别数量差异较大的情况,比如在100w日志当中,99.9%都是正常的,只有0.1%的日志是异常的</li>
</ul>
<pre><code>from sklearn.metrics import roc_curve, roc_auc_score
y_proba = model.predict_proba(X)[:, 1]
auc_score = roc_auc_score(y, y_proba)
print('=='*20)
print(f"AUC = {auc_score:.4f}")
</code></pre>
<p><img alt="watermarked-logistic_regression_1_6" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140254688-1089175319.png" class="lazyload"></p>
<ul>
<li>AUC越接近1,表示分类模型泛化能力越好,如果在0.5左右,代表着跟猜的一样差</li>
</ul>
<pre><code>import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y, y_proba)
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (AUC = {auc_score:.4f})')
plt.plot(, , color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
</code></pre>
<p><img alt="watermarked-logistic_regression_1_7" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140304802-40588448.png" class="lazyload"></p>
<p>直接丢gpt看下吧</p>
<p><img alt="watermarked-logistic_regression_1_8" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140311148-55374057.png" class="lazyload"></p>
</li>
</ul>
<h2 id="多特征下的逻辑回归">多特征下的逻辑回归</h2>
<h4 id="决策边界">决策边界</h4>
<p>先来讨论一下决策边界,决策边界是先推导出回归系数与截距之后,再带入模型</p>
<p></p><div class="math display">\[\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} \quad , z = \mathbf{w}^\top \mathbf{x} + b
\]</div><p></p><p>如果是单特征:</p>
<p></p><div class="math display">\[\hat{y} = \sigma(w_1x_1+b) = \frac{1}{1 + e^{-(w_1x_1+b)}} \quad
\]</div><p></p><p>取分类阈值为0.5,为什么要取0.5,大部分情况,二分类中<code>0</code>和<code>1</code>的可能性是均等的,通常任务>0.5为<code>1</code>,反之<0.5则为<code>0</code>。但是遇到所谓的分类不平衡的情况,就要变化了,这个后面再讨论,这里先姑且取0.5</p>
<p></p><div class="math display">\[\frac{1}{1 + e^{-(w_1x_1+b)}} = 0.5 \quad
\]</div><p></p><p></p><div class="math display">\[e^{-(w_1x_1+b)} = 1 \quad
\]</div><p></p><p></p><div class="math display">\[-(w_1x_1+b) = 0 \quad
\]</div><p></p><p></p><div class="math display">\[x_1 = -\frac{b}{w_1} \quad
\]</div><p></p><p>可以看到单特征的决策边界是一个点,这就非常容易区分<code>0</code>和<code>1</code>了</p>
<p>如果是2个特征:</p>
<p></p><div class="math display">\[\hat{y} = \sigma(w_1x_1+w_2x_2+b) = \frac{1}{1 + e^{-(w_1x_1+w_2x_2+b)}} \quad
\]</div><p></p><p>同理<span class="math inline">\(\hat{y}=0.5\)</span></p>
<p></p><div class="math display">\[\frac{1}{1 + e^{-(w_1x_1+w_2x_2+b)}} = 0.5 \quad
\]</div><p></p><p></p><div class="math display">\[x_2=-\frac{w_1x_1+b}{w_2}
\]</div><p></p><p>可以看到2个特征的决策边界是<code>y=x</code>的直线</p>
<p>同理3个特征是一个面,>3个特征就已经不能画出来了</p>
<h4 id="2个特征">2个特征</h4>
<p>继续刚才的问题,比如除了抽烟被打,再加上喝酒,2个特征</p>
<table>
<thead>
<tr>
<th></th>
<th>星期一</th>
<th>星期二</th>
<th>星期三</th>
<th>星期四</th>
<th>星期五</th>
<th>星期六</th>
<th>星期日</th>
</tr>
</thead>
<tbody>
<tr>
<td>抽烟(单位:根)</td>
<td>6</td>
<td>18</td>
<td>14</td>
<td>13</td>
<td>5</td>
<td>10</td>
<td>8</td>
</tr>
<tr>
<td>喝酒(单位:两)</td>
<td>8</td>
<td>1</td>
<td>2</td>
<td>4</td>
<td>3</td>
<td>3</td>
<td>0</td>
</tr>
<tr>
<td>是否被老婆打</td>
<td>是</td>
<td>否</td>
<td>否</td>
<td>是</td>
<td>否</td>
<td>是</td>
<td>是</td>
</tr>
</tbody>
</table>
<pre><code>from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np
X = np.array([
,
,
,
,
,
,
,
])
y = np.array()
model = LogisticRegression()
model.fit(X, y)
coef = model.coef_
intercept = model.intercept_
print(f"系数: {coef}")
print(f"截距: {intercept}")
</code></pre>
<p><img alt="watermarked-logistic_regression_1_9" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140323575-30138938.png" class="lazyload"></p>
<p>决策边界:$$ y=\frac{0.127x-0.94}{0.26} $$</p>
<pre><code>import matplotlib.pyplot as plt
x_vals = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 100)
decision_boundary = -(coef * x_vals + intercept) / coef
plt.figure(figsize=(8, 6))
colors = ['red' if label == 0 else 'blue' for label in y]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=80, edgecolor='k')
plt.plot(x_vals, decision_boundary, 'k--', label='Decision Boundary')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
</code></pre>
<p><img alt="watermarked-logistic_regression_1_11" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140332521-352693555.png" class="lazyload"></p>
<p>在边界以上的是1,边界以下的0</p>
<h2 id="类别不平衡">类别不平衡</h2>
<p>比如以下代码,1000个样本中,只有14个<code>1</code>,986个<code>0</code>,属于严重的类别不平衡</p>
<pre><code>from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=5,
weights=, flip_y=0.01,
class_sep=0.5, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
c_report = classification_report(y_test, y_pred, zero_division=0)
print("Logistic Regression 分类报告:\n", c_report)
</code></pre>
<p><img alt="watermarked-logistic_regression_1_12" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140344470-1043489583.png" class="lazyload"></p>
<ul>
<li>precision:模型在识别少数类<code>1</code>上完全失败,虽然多数类<code>0</code>的准确率是99%,但是毫无意义,从未正确预测为<code>1</code></li>
<li>recall:所有真正为<code>0</code>的样本都被找到了(100%);一个<code>1</code>类都没找到</li>
<li>f1-score:类别<code>1</code>的 F1 是 0,说明模型对少数类的预测能力完全崩溃</li>
<li>support:类别<code>0</code>有 296 个样本,类别<code>1</code>只有 4 个样本</li>
<li>accuracy:0.99,模型总共预测对了 296 个,错了 4 个</li>
<li>macro avg:每个类的指标的“简单平均”,不考虑样本数权重</li>
<li>weighted avg:各类指标的“加权平均”,考虑样本量</li>
</ul>
<p>有位彦祖说了,你这分类只分了1次训练集和测试集,如果带上交叉验证,多分几次类,让其更有机会学习到少数类,情况能不能有所改善?</p>
<pre><code>from sklearn.model_selection import cross_val_predict
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
y_pred = cross_val_predict(model, X, y, cv=cv)
c_report = classification_report(y, y_pred, zero_division=0)
print("Logistic Regression(交叉验证)分类报告:\n", c_report)
</code></pre>
<p><img alt="watermarked-logistic_regression_1_13" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140356268-1313377023.png" class="lazyload"></p>
<p>情况并没有好转,模型依然无法区分少数类</p>
<h4 id="权重调整">权重调整</h4>
<pre><code>model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
c_report = classification_report(y_test, y_pred, zero_division=0)
print("Logistic Regression 加权 分类报告:\n", c_report)
</code></pre>
<p><img alt="watermarked-logistic_regression_1_14" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140404085-32436292.png" class="lazyload"></p>
<p>情况有所好转</p>
<ul>
<li><code>1</code>的recall从0-->0.5,2 个正类样本中至少预测中了 1 个</li>
<li><code>1</code>的Precision从0-->0.01,模型预测为正类的样本大多数是错的,这是 class_weight 造成的:宁愿错也要猜一猜正类</li>
<li><code>0</code>的recall从1-->0.7,同样是class_weight造成的,把一部分原本是负类的样本错判为正类了</li>
<li>accuracy从99%-->70%,模型开始尝试预测少数类,虽然整体正确率下降,但变得更愿意去预测少数类了</li>
</ul>
<h4 id="过采样">过采样</h4>
<p>增加少数类样本,复制或生成新样本,通过 SMOTE(Synthetic Minority Over-sampling Technique)进行过采样</p>
<pre><code>from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict
model = Pipeline([
('smote', SMOTE(random_state=0)),
('logreg', LogisticRegression(solver='lbfgs', max_iter=1000))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
y_pred = cross_val_predict(model, X, y, cv=cv)
print("SMOTE + LogisticRegression 分类报告:\n")
print(classification_report(y, y_pred, zero_division=0))
</code></pre>
<p><img alt="watermarked-logistic_regression_1_15" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140413116-1206725104.png" class="lazyload"></p>
<ul>
<li>recall提升到了0.64,模型识别了少数类的概率提升了</li>
<li>Precision=0.04,精确率依旧不佳</li>
<li>accuracy=0.75,由于少数类的识别概率提升,所以整体的准确率有所提升</li>
</ul>
<h4 id="欠采样">欠采样</h4>
<p>减少多数类样本(随机删除或聚类),通过RandomUnderSampler进行欠采样</p>
<pre><code>from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict
pipeline = Pipeline([
('undersample', RandomUnderSampler(random_state=0)),
('logreg', LogisticRegression(solver='lbfgs', max_iter=1000))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
y_pred = cross_val_predict(pipeline, X, y, cv=cv)
print("欠采样 + LogisticRegression 分类报告:\n")
print(classification_report(y, y_pred, zero_division=0))
</code></pre>
<p><img alt="watermarked-logistic_regression_1_16" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140420293-593614614.png" class="lazyload"></p>
<p>与过采样大同小异,效果还不如过采样</p>
<h4 id="正则化">正则化</h4>
<p>lasso与Ridge在这里依然可以使用</p>
<pre><code>from imblearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict
pipeline = Pipeline([
('smote', SMOTE(random_state=0)),
('lasso', LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000, random_state=0))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
y_pred = cross_val_predict(pipeline, X, y, cv=cv)
print("SMOTE + Lasso Logistic Regression(L1)分类报告:\n")
print(classification_report(y, y_pred, zero_division=0))
</code></pre>
<p><img alt="watermarked-logistic_regression_1_17" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140428101-255118588.png" class="lazyload"></p>
<h4 id="代价敏感学习">代价敏感学习</h4>
<p>这其实也是其中调整的一种,只不过针对于class_weight这个超参数,进行了更精细化得调整</p>
<pre><code>from imblearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict
pipeline = Pipeline([
('smote', SMOTE(random_state=0)),
('lasso', LogisticRegression(class_weight={0: 1, 1: 50}))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
y_pred = cross_val_predict(pipeline, X, y, cv=cv)
print("class_weight {0:1, 1:50} 分类报告:\n")
print(classification_report(y, y_pred, zero_division=0))
</code></pre>
<p><code>class_weight={0: 1, 1: 50}</code> 的含义:</p>
<ul>
<li>类别 0(多数类)的权重为 1(标准惩罚)</li>
<li>类别 1(少数类)的权重为 50(错误预测时惩罚更严重)</li>
</ul>
<p><img alt="watermarked-logistic_regression_1_18" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202509/1416773-20250903140435652-631503619.png" class="lazyload"></p>
<p>这是一种牺牲准确率为代价,尽量不要漏掉任何一个少数类,所以表现就是少数类<code>1</code>的precision很低,但是recall是非常高的。这就是所谓的宁可错杀一千,也绝不放过一个</p>
<h4 id="小结">小结</h4>
<p>在逻辑回归中,针对类别不平衡的问题,往往有两种决策</p>
<ul>
<li>一种是宁可误报,也不能漏报。先把少数类找出来,再对少数类进行进一步的校验。比如预测入侵筛查、代码漏洞检测等</li>
<li>另外一种则是需要更关注多数类,有少数类被误报,也是可以接受。比如垃圾邮件分类、推荐系统的准确率等</li>
</ul>
<h2 id="联系我">联系我</h2>
<ul>
<li>联系我,做深入的交流<br>
<img alt="" width="500" height="200" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202411/1416773-20241121135740959-1907948957.png#" class="lazyload"></li>
</ul>
<hr>
<p>至此,本文结束<br>
在下才疏学浅,有撒汤漏水的,请各位不吝赐教...</p>
</div>
<div id="MySignature" role="contentinfo">
<p>本文来自博客园,作者:it排球君,转载请注明原文链接:https://www.cnblogs.com/MrVolleyball/p/19071731</p>
<div>本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须在文章页面给出原文连接,否则保留追究法律责任的权利。 </div><br><br>
来源:https://www.cnblogs.com/MrVolleyball/p/19071731
頁:
[1]