记一次 .NET 某工控PCB巡检系统 崩溃分析
<h2 id="一背景">一:背景</h2><h3 id="1-讲故事">1. 讲故事</h3>
<p>前些天训练营里的一位学员找到我,说他们的系统出现了崩溃,自己分析了遍也没找到是什么原因,让我帮忙看下怎么回事?dump拿到手后,接下来就上windbg分析。</p>
<h2 id="二崩溃分析">二:崩溃分析</h2>
<h3 id="1-为什么会崩溃">1. 为什么会崩溃</h3>
<p>打开dump之后,windbg 会自动定位崩溃点,输出如下:</p>
<pre><code class="language-C#">
................................................................
................................................................
.........................................
Loading unloaded module list
...........................................
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(1cec.1984): Access violation - code c0000005 (first/second chance not available)
+------------------------------------------------------------------------+
| This target supports Hardware-enforced Stack Protection. A HW based |
| "Shadow Stack" may be available to assist in debugging and analysis. |
| See aka.ms/userhsp for more info. |
| |
| dps @ssp |
| |
+------------------------------------------------------------------------+
For analysis of this file, run !analyze -v
clr!WKS::gc_heap::find_first_object+0xea:
00007ff9`9faea3eb 833800 cmp dword ptr ,0 ds:00000461`0000085a=????????
</code></pre>
<p>从卦中的 <code>find_first_object</code> 函数来看,这是GC在寻找需要标记的对象时出现了空地址,即经典的 <code>托管堆损坏</code> 问题。。。为了验证可以使用 <code>!verifyheap</code> 命令,输出如下:</p>
<pre><code class="language-C#">
0:016> !verifyheap
Could not request method table data for object 00000296DB67CFC0 (MethodTable: 0000046100000858).
Last good object: 00000296DB67CEF0.
</code></pre>
<h3 id="2-为什么托管堆损坏了">2. 为什么托管堆损坏了</h3>
<p>从时间轴的角度来看,<code>托管堆损坏</code> 属于第二现场,第一现场是恶意的破坏现场,由于时间不能倒流,所以从dump中我们无法看到曾经发生过的事,那怎么办呢?有一个办法就是直接看 <code>破坏现场</code>,哈哈,这个是不是有点像<code>法医学</code>。。。 使用 <code>dp 00000296DB67CFC0-0x80 L20</code> 观察对象附近的破坏场所,输出如下:</p>
<pre><code class="language-C#">
0:016> dp 00000296DB67CFC0-0x80 L20
00000296`db67cf4041816d40`414f1533 43202a65`41000000
00000296`db67cf50414f1533`43202a65 41016d40`41016d40
00000296`db67cf6040cf1533`40cf1533 414f1533`40cf1533
00000296`db67cf70411fd70a`3f000000 43202a65`411fd70a
00000296`db67cf8041016d40`43202a65 40800000`411b4fe6
00000296`db67cf9041a00005`4247fffc 3f000000`41a00005
00000296`db67cfa04221c88f`41200000 00000000`411b4fe6
00000296`db67cfb0000000be`00000000 00000523`000003ee
00000296`db67cfc000000461`0000085b 00000004`000007a6
00000296`db67cfd0000003ee`000000be 0000085b`00000523
00000296`db67cfe0000007a6`00000461 000000be`00000004
00000296`db67cff000000523`000003ee 00000461`0000085b
00000296`db67d00000000004`000007a6 00000000`00000000
00000296`db67d01000000000`00000000 00000000`00000000
00000296`db67d02000000000`00000000 00000000`00000000
00000296`db67d03000000000`00000000 00000000`00000000
</code></pre>
<p>再回头看下错误信息,说 <code>00000296DB67CFC0</code> 处应该是<code>方法表</code>,结果变成了现在的很多数字,看起来像是 C++ 写入的数组,为了防止误判,我让朋友又继续抓崩溃dump,看看dump是不是具有随机性,防止南辕北辙,朋友也顺利的抓到了第二个dump。</p>
<pre><code class="language-C#">
For analysis of this file, run !analyze -v
clr!WKS::gc_heap::find_first_object+0x83:
00007ff9`a077a388 833900 cmp dword ptr ,0 ds:000001cc`00000666=????????
0:094> !verifyheap
Could not request method table data for object 000001C64B541738 (MethodTable: 000001CC00000664).
Last good object: 000001C64B53F758.
0:094> dp 000001C64B541738-0x80 L20
000001c6`4b5416b800000000`00000000 00000000`00000000
000001c6`4b5416c800000000`00000000 00000000`00000000
000001c6`4b5416d800000000`00000000 00000000`00000000
000001c6`4b5416e800000000`00000000 00000000`00000000
000001c6`4b5416f800000000`00000000 00000000`00000000
000001c6`4b54170800000000`00000000 00000000`00000000
000001c6`4b54171800000000`00001fe0 00000000`00000000
000001c6`4b541728000000be`00000000 00000261`00000556
000001c6`4b541738000001cc`00000666 00000004`000005c7
000001c6`4b54174800000556`000000be 00000666`00000261
000001c6`4b541758000005c7`000001cc 000000be`00000004
000001c6`4b54176800000261`00000556 000001cc`00000666
000001c6`4b54177800000004`000005c7 00000000`00000000
000001c6`4b54178800000000`00000000 00000000`00000000
000001c6`4b54179800000000`00000000 00000000`00000000
000001c6`4b5417a800000000`00000000 00000000`00000000
</code></pre>
<p>从卦中看,第二个dump也是出现了类似 C++ 数组的内容,到这里基本就能断定有人有意或者无意的往托管堆写入数组内容,导致托管堆对象破坏,让朋友关注下代码中的 fixed,pinvoke 之类,截图如下:</p>
<p><img src="https://img2024.cnblogs.com/blog/214741/202508/214741-20250820094148060-2097315959.png" alt="" loading="lazy"></p>
<h3 id="3-后续花絮">3. 后续花絮</h3>
<p>几天之后,朋友给我带来了一个好消息,说它通过 <code>assert</code> 断言一步一步的试,最终还真给找到了。。。大概就是 C++ 写托管堆的时候越界了,参考代码如下:</p>
<pre><code class="language-C">
int cadx = measure_info.posx;
int cady = measure_info.posy;
int samplex = cadx - nSamplePosInCADx;
int sampley = cady - nSamplePosInCADy;
if (samplex < 0 || samplex >= nWidthSample || sampley < 0 || sampley >= nHeightSample)
{
continue;
}
int offpos = sampley / grid_ver * 5 + samplex / grid_her; //错误的 int offpos = cady / grid_ver * 5 + cadx / grid_her
assert(offpos >= 0 && offpos < 25);
int cad_offx = pnSubOffset, cad_offy = pnSubOffset;
</code></pre>
<p>找到当然是开心的,也确实这种问题比较难搞,不过不知道朋友为什么没有使用我推荐的 ttd 方式,毕竟它的程序有一个重要的特征,即启动后1分钟之内必崩,完全可以尝试ttd,参考如下:</p>
<pre><code class="language-C#">
0:094> vertarget
Windows 10 Version 19044 MP (32 procs) Free x64
Product: WinNt, suite: SingleUserTS
Edition build lab: 19041.1.amd64fre.vb_release.191206-1406
Debug session time: Wed Mar 12 15:12:41.000 2025 (UTC + 8:00)
System Uptime: 32 days 4:35:52.688
Process Uptime: 0 days 0:00:41.000
Kernel time: 0 days 0:01:21.000
User time: 0 days 0:08:01.000
</code></pre>
<h2 id="三总结">三:总结</h2>
<p>这次事故是 C++ 操控 C# 托管对象时,C++这边数组越界导致的托管堆损坏引发崩溃,这种仅凭第二现场就能寻找蛛丝马迹的案例,真的少之又少。。。也算是不幸中的万幸吧,当然也在于朋友的不抛弃不放弃,终见曙光,调试难!</p>
<img src="https://images.cnblogs.com/cnblogs_com/huangxincheng/345039/o_210929020104最新消息优惠促销公众号关注二维码.jpg" width="700" height="300" alt="图片名称" align="center"><br><br>
来源:https://www.cnblogs.com/huangxincheng/p/19047949
頁:
[1]