彩笔运维勇闯机器学习-孤立森林 - Powered by Discuz! Archiver

星雨吻花 發表於 2025-10-17 10:20:00

彩笔运维勇闯机器学习--孤立森林

<h2 id="前言">前言</h2>
<p>孤立森林，一种非常高效快速的异常检测算法</p>
<h2 id="开始探索">开始探索</h2>
<h4 id="scikit-learn">scikit-learn</h4>
<pre><code>import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(0)

X_train = 0.3 * rng.randn(100, 2)
X_outliers = rng.uniform(low=-2, high=2, size=(10, 2))

clf = IsolationForest(n_estimators=100, max_samples='auto', contamination='auto', random_state=rng)
clf.fit(X_train)

y_pred_train = clf.predict(X_train)
y_pred_outliers = clf.predict(X_outliers)

plt.title("Isolation Forest")
plt.scatter(X_train[:, 0], X_train[:, 1], color='b', label="Normal")
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='r', label="Outliers")

plt.legend()
plt.axis('tight')
plt.show()

</code></pre>
<p>脚本！启动：</p>
<p><img alt="watermarked-isolation_forest_1_4" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202510/1416773-20251017095648061-731780373.png" class="lazyload"></p>
<h2 id="深入理解">深入理解</h2>
<p>类似于随机森林，但每棵树不使用信息增益或基尼系数等指标，而是随机选择一个特征，在该特征的最小值和最大值之间随机选一个切分值，将数据集分成两部分，又在每个部分随机最大值与最小值之间随机选一个切分支，不断递归。指导到达指定深度或者当前节点只有1个样本</p>
<p>构造如此的树n棵，组成森林，开始计算每个样本在每棵树的平均路径长度（叶子节点的深度depth），计算异常分数</p>
<p></p><div class="math display">\[\begin{cases}
c(n)=2H(n-1)-\frac{2(n-1)}{n} \\
H(n)=\sum_{i=1}^{n} \frac{1}{i} \\
s(x,n) = 2^{-\frac{E(x)}{c(n)}}
\end{cases}
\]</div><p></p><ul>
<li><span class="math inline">\(s \approx 1\)</span>，强异常点，很容易被孤立</li>
<li><span class="math inline">\(0.5 \le s < 1\)</span>，可能是异常点，越接近1越是异常点，需要配合其他参数来确定，比如异常点比例</li>
<li><span class="math inline">\(s < 0.5\)</span>，正常点</li>
</ul>
<h4 id="举例说明">举例说明</h4>
<p>假设有以下样本： </p>
<p>构造第一棵树：</p>
<p>1）第一层：depth=1，随机选择划分值：<code>1 < split < 10</code> 的区间中选择 <code>split = 5</code></p>
<ul>
<li>左子树：</li>
<li>右子树：</li>
</ul>
<p>2）第二层：depth=2，随机选择划分值：<code>1 < split < 2.3</code> 的区间中选择 <code>split = 1.6</code></p>
<ul>
<li>左子树：</li>
<li>右子树：</li>
</ul>
<p>3）第三层：depth=3，左子树，随机选择划分值：<code>1 < split < 1.5</code> 的区间中选择 <code>split = 1.2</code></p>
<ul>
<li>左子树：</li>
<li>右子树：</li>
</ul>
<p>4）第三层：depth=3，右子树，随机选择划分值：<code>1.8 < split < 2.3</code> 的区间中选择 <code>split = 2.1</code></p>
<ul>
<li>左子树：</li>
<li>右子树：</li>
</ul>
<p>5）第四层：depth=4，随机选择划分值：<code>1.8 < split < 2.9</code> 的区间中选择 <code>split = 1.9</code></p>
<ul>
<li>左子树：</li>
<li>右子树：</li>
</ul>
<p>6）计算路径</p>
<p><img alt="watermarked-isolation_forest_1_1" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202510/1416773-20251017095637364-619966229.png" class="lazyload"></p>
<table>
<thead>
<tr>
<th>样本值</th>
<th>路径长度</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.0</td>
<td>3</td>
</tr>
<tr>
<td>1.5</td>
<td>3</td>
</tr>
<tr>
<td>1.8</td>
<td>4</td>
</tr>
<tr>
<td>2.0</td>
<td>4</td>
</tr>
<tr>
<td>2.3</td>
<td>3</td>
</tr>
<tr>
<td>10.0</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>重复构造第n棵树，得出路径，计算路径平均值</p>
<table>
<thead>
<tr>
<th>样本值</th>
<th>第1棵树路径</th>
<th>第2棵树路径</th>
<th>第n棵树路径</th>
<th>平均路径</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.0</td>
<td>3</td>
<td>3</td>
<td>..</td>
<td>3</td>
</tr>
<tr>
<td>1.5</td>
<td>3</td>
<td>3</td>
<td>..</td>
<td>3</td>
</tr>
<tr>
<td>1.8</td>
<td>4</td>
<td>4</td>
<td>..</td>
<td>4</td>
</tr>
<tr>
<td>2.0</td>
<td>4</td>
<td>4</td>
<td>..</td>
<td>4</td>
</tr>
<tr>
<td>2.3</td>
<td>3</td>
<td>3</td>
<td>..</td>
<td>3</td>
</tr>
<tr>
<td>10.0</td>
<td>1</td>
<td>1</td>
<td>..</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>计算异常得分</p>
<p></p><div class="math display">\[s(x,n) = 2^{-\frac{E(x)}{c(n)}}
\]</div><p></p><p>1）计算样本（1.0）：</p>
<ul>
<li>样本长度：<span class="math inline">\(E(x) = 3\)</span></li>
<li>样本规模 <span class="math inline">\(n=6\)</span> 的平均路径期望：</li>
</ul>
<p></p><div class="math display">\[\begin{cases}
c(n)=2H(n-1)-\frac{2(n-1)}{n} \\
H(n)=\sum_{i=1}^{n} \frac{1}{i}
\end{cases}
\]</div><p></p><p></p><div class="math display">\[c(6)=2H(n-1)-\frac{2(n-1)}{n}=2·(1+\frac{1}{2}+\frac{1}{3}+\frac{1}{4}+\frac{1}{5}) - \frac{2(6-1)}{6} \approx 2.8999
\]</div><p></p><p></p><div class="math display">\[s(1.0) = 2^{-\frac{E(x)}{c(n)}} = 2^{-\frac{3}{2.8999}} \approx 0.4882
\]</div><p></p><p>2）计算所有样本</p>
<table>
<thead>
<tr>
<th>样本值</th>
<th>平均长度</th>
<th>异常得分</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.0</td>
<td>3</td>
<td>0.4882</td>
</tr>
<tr>
<td>1.5</td>
<td>3</td>
<td>0.4882</td>
</tr>
<tr>
<td>1.8</td>
<td>4</td>
<td>0.3844</td>
</tr>
<tr>
<td>2.0</td>
<td>4</td>
<td>0.3844</td>
</tr>
<tr>
<td>2.3</td>
<td>3</td>
<td>0.4882</td>
</tr>
<tr>
<td>10.0</td>
<td>1</td>
<td>0.7874</td>
</tr>
</tbody>
</table>
<hr>
<p>判断异常点：</p>
<ul>
<li>路径长度越短的越异常，比如10.0的路径长度为1，在第一次分割的时候就被孤立了</li>
<li>异常分数越高就是异常点</li>
</ul>
<h4 id="sklearn中的异常分数">sklearn中的异常分数</h4>
<pre><code>from sklearn.ensemble import IsolationForest
import numpy as np

X = np.array([, , , , , ])

clf = IsolationForest(random_state=0, contamination='auto')
clf.fit(X)

pred = clf.predict(X)
score = clf.decision_function(X)

for x, p, s in zip(X, pred, score):
print(f"样本 {x:>4} -> {'异常' if p==-1 else '正常'} | 异常分数（decision_function）: {s:.4f}")

</code></pre>
<p>脚本！启动：</p>
<p><img alt="watermarked-isolation_forest_1_2" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202510/1416773-20251017095719324-1584615772.png" class="lazyload"></p>
<p>问题出现了：</p>
<ul>
<li>sklearn的分数和手工计算的并不一样</li>
<li>为什么<code>1.0</code>被当成异常了</li>
<li>分数越小反而越异常</li>
</ul>
<p>先看第一个问题，<code>sklearn的分数和手工计算的并不一样</code>。首先，每棵树是采用部分的样本来计算，而不是采用所有的样本<code>n=6</code>来计算的。其次，在上面的手工计算中，期望路径长度<span class="math inline">\(c(n)\)</span>中的<span class="math inline">\(H(n)\)</span>，并不是由这个公式计算的</p>
<p></p><div class="math display">\[\begin{cases}
c(n)=2H(n-1)-\frac{2(n-1)}{n} \\
H(n)=\sum_{i=1}^{n} \frac{1}{i}
\end{cases}
\]</div><p></p><p>这个公式一旦n的数量增大，<span class="math inline">\(H(n)\)</span>的计算将会带来很大的计算消耗，通常使用另外一个公式计算近似值：</p>
<p></p><div class="math display">\[H(n) \approx ln(n) + \gamma ，其中\gamma \approx 0.5772（欧拉常数）
\]</div><p></p><p>以上两点原因，带来的就是sklearn计算异常分数与手工计算不一样</p>
<p>再看第二个问题，为什么<code>1.0</code>被当成异常了</p>
<p>只需要调整一个参数，<code>contamination=0.1</code>就可以解决这个问题了</p>
<p><img alt="watermarked-isolation_forest_1_3" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202510/1416773-20251017095726394-2048782056.png" class="lazyload"></p>
<p><code>contamination</code>用来调节异常比例的参数，如果是<code>auto</code>，那么异常比例为33.3%，6个样本，那么异常点就是2个。手动调整为0.1，那就告诉模型只有1个异常点，那么最不正常的就是<code>10.0</code>了</p>
<p>最后第三个问题，分数越小反而越异常。这明显是计算方式不一样造成的，这里直接解析一下源码，版本：<code>scikit-learn:1.6.1</code>：</p>
<ul>
<li>
<p><code>decision_function</code>函数</p>
<pre><code> def decision_function(self, X):
   return self.score_samples(X) - self.offset_
</code></pre>
</li>
<li>
<p><code>score_samples</code>函数返回的是：经过公式<span class="math inline">\(s(x,n) = 2^{-\frac{E(x)}{c(n)}}\)</span>计算的相反数</p>
<pre><code> def score_samples(self, X):
   ...
   return self._score_samples(X)

def _score_samples(self, X):
   return -self._compute_chunked_score_samples(X)

def _compute_chunked_score_samples(self, X):
   ...

   for sl in slices:
         # compute score on the slices of test samples:
         scores = self._compute_score_samples(X, subsample_features)

   return scores

def _compute_score_samples(self, X, subsample_features):
   ...
   scores = 2 ** (
         # For a single training sample, denominator and depth are 0.
         # Therefore, we set the score manually to 1.
         -np.divide(
            depths, denominator, out=np.ones_like(depths), where=denominator != 0
         )
   )
   return scores
</code></pre>
</li>
<li>
<p><code>self.offset_</code>是根据整个样本异常分数，再加上异常比例参数<code>contamination</code>的中位数计算出来的</p>
<pre><code>    self.offset_ = np.percentile(self._score_samples(X), 100.0 * self.contamination)

</code></pre>
</li>
</ul>
<p>看到这里，我就想说，复杂就行了，经过这么复杂的计算，与手动计算出来的肯定不一样</p>
<h4 id="小结">小结</h4>
<p>在sklearn中</p>
<ul>
<li>找到孤立点，<code>contamination</code>是一个非常重要的参数，它决定了每个节点的分数以及后续确定是否异常</li>
<li>快速找到孤立点，直接通过<code>pred</code>函数即可，<code>-1</code>是孤立点，<code>1</code>是正常点</li>
<li>想要获取点的评分，通过<code>decision_function</code>函数获取评分，与理论公式不同，评分越低反而越异常</li>
</ul>
<h2 id="小结-1">小结</h2>
<ul>
<li>联系我，做深入的交流<br>
<img alt="" width="500" height="200" loading="lazy" src="https://img2024.cnblogs.com/blog/1416773/202411/1416773-20241121135740959-1907948957.png#" class="lazyload"></li>
</ul>
<hr>
<p>至此，本文结束<br>
在下才疏学浅，有撒汤漏水的，请各位不吝赐教...</p>

</div>
<div id="MySignature" role="contentinfo">
<p>本文来自博客园，作者：it排球君，转载请注明原文链接：https://www.cnblogs.com/MrVolleyball/p/19147189</p>
<div>本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须在文章页面给出原文连接，否则保留追究法律责任的权利。 </div><br><br>
来源：https://www.cnblogs.com/MrVolleyball/p/19147189

頁: [1]

圆梦公社's Archiver

彩笔运维勇闯机器学习--孤立森林