记一次 .NET 某医联体管理系统 崩溃分析
<h2 id="一背景">一:背景</h2><h3 id="1-讲故事">1. 讲故事</h3>
<p>这段时间都在跑外卖,感觉好久都没写文章了,今天继续给大家带来一篇崩溃类的生产事故,这是微信上有位老朋友找到我的,让我帮忙看下为啥崩溃了,dump也在手,接下来就可以一顿分析。</p>
<h2 id="二崩溃分析">二:崩溃分析</h2>
<h3 id="1-为什么会崩溃">1. 为什么会崩溃</h3>
<p>双击打开dump文件,会看到崩溃信息通览,参考如下:</p>
<pre><code class="language-C#">
Executable search path is:
Windows 10 Version 17763 MP (48 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
Edition build lab: 17763.1.amd64fre.rs5_release.180914-1434
Debug session time: Fri Oct 31 17:38:42.000 2025 (UTC + 8:00)
System Uptime: 14 days 2:42:29.643
Process Uptime: 0 days 0:00:58.000
................................................................
.......................................
Loading unloaded module list
.
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(5a74.6250): Unknown exception - code c0000374 (first/second chance not available)
For analysis of this file, run !analyze -v
ntdll!NtWaitForMultipleObjects+0x14:
00007ffe`57baf0e4 c3 ret
</code></pre>
<p>从卦中看崩溃码是 <code>c0000374</code>,即 ntheap 损坏,哈哈,到这里一下子就把范围给缩小了。</p>
<h3 id="2-为什么ntheap-损坏">2. 为什么ntheap 损坏</h3>
<p>那为什么ntheap会损坏呢?可以使用 <code>.ecxr</code> 切到崩溃时的调用栈,观察崩溃行为。</p>
<pre><code class="language-C#">
0:032> .ecxr
0:032> k
*** Stack trace for last set context - .thread/.cxr resets it
# Child-SP RetAddr Call Site
00 000000b4`8503ede0 00007ffe`57c0b313 ntdll!RtlReportFatalFailure+0x9
01 000000b4`8503ee30 00007ffe`57c13b9e ntdll!RtlReportCriticalFailure+0x97
02 000000b4`8503ef20 00007ffe`57c13eaa ntdll!RtlpHeapHandleError+0x12
03 000000b4`8503ef50 00007ffe`57bae109 ntdll!RtlpHpHeapHandleError+0x7a
04 000000b4`8503ef80 00007ffe`57bbbb0e ntdll!RtlpLogHeapFailure+0x45
05 000000b4`8503efb0 00007ffe`17d17b3f ntdll!RtlFreeHeap+0x9d3ce
06 000000b4`8503f050 00007ffe`541392af AcLayers!NS_FaultTolerantHeap::APIHook_RtlFreeHeap+0x41f
07 000000b4`8503f0b0 00007ffe`3773b17e KERNELBASE!LocalFree+0x2f
08 000000b4`8503f0f0 00007ffe`37661d12 mscorlib_ni+0x58b17e
09 000000b4`8503f1a0 00007ffd`e49fe127 mscorlib_ni!System.Runtime.InteropServices.Marshal.FreeHGlobal+0x22
...
0:032> !clrstack
OS Thread Id: 0x6250 (32)
Child SP IP Call Site
000000b48503f118 00007ffe57baf0e4 Microsoft.Win32.Win32Native.LocalFree(IntPtr)
000000b48503f118 00007ffe3773b17e Microsoft.Win32.Win32Native.LocalFree(IntPtr)
000000b48503f0f0 00007ffe3773b17e DomainNeutralILStubClass.IL_STUB_PInvoke(IntPtr)
000000b48503f1a0 00007ffe37661d12 System.Runtime.InteropServices.Marshal.FreeHGlobal(IntPtr)
000000b48503f1e0 00007ffde49fe127 b.B+A.MoveNext()
000000b48503f240 00007ffe376b3423 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000b48503f310 00007ffe376b32b4 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
...
000000b48503f5c0 00007ffde49fb04e DomainBoundILStubClass.IL_STUB_ReversePInvoke(Int32, Int32, Int64)
</code></pre>
<p>从卦中可以清晰的看到是 <code>b.B+A.MoveNext</code> 方法中调用了 <code>FreeHGlobal</code> 导致的NTHeap崩溃,如果你经验比较足的话,看到这个 <code>FreeHGlobal</code> 就应该想到 <code>double free</code> 问题,这是一个经典的问题。</p>
<h3 id="3-何为-double-free">3. 何为 double free</h3>
<p>双释放即对一个 block 块进行二次释放,windows 的 RtlFreeHeap 方法会在业务逻辑中对这种情况直接判为异常,接下来你或许想知道这个 block 的地址是什么?这个可以用 <code>!heap -s</code> 观察,参考代码如下:</p>
<pre><code class="language-C#">
0:032> !heap -s
************************************************************************************************************************
NT HEAP STATS BELOW
************************************************************************************************************************
Details:
Heap address:0000028c75bb0000
Error address: 0000028c786018a0
Error type: HEAP_FAILURE_BLOCK_NOT_BUSY
Details: The caller performed an operation (such as a free
or a size check) that is illegal on a free block.
Follow-up:Check the error's stack trace to find the culprit.
Stack trace:
Stack trace at 0x00007ffe57c72848
00007ffe57bae109: ntdll!RtlpLogHeapFailure+0x45
00007ffe57bbbb0e: ntdll!RtlFreeHeap+0x9d3ce
00007ffe17d17b3f: AcLayers!NS_FaultTolerantHeap::APIHook_RtlFreeHeap+0x41f
00007ffe541392af: KERNELBASE!LocalFree+0x2f
00007ffe3773b17e: mscorlib_ni+0x58b17e
00007ffe37661d12: mscorlib_ni!System.Runtime.InteropServices.Marshal.FreeHGlobal+0x22
00007ffde49fe127: +0xe49fe127
LFH Key : 0x765363a7204cf973
Termination on corruption : ENABLED
Heap Flags ReservCommitVirt FreeList UCRVirtLockFast
(k) (k) (k) (k) length blocks cont. heap
-------------------------------------------------------------------------------------
0000028c75bb0000 00000002 17920 925616364 2120 214 5 1 a LFH
External fragmentation23 % (214 free blocks)
0000028c75b40000 00008000 64 4 64 2 1 1 0 0
0000028c75de0000 00001002 2636 132 1080 20 5 2 0 0 LFH
0000028c76190000 00001002 4680 2268 3124 1420 40 3 0 0 LFH
External fragmentation62 % (40 free blocks)
0000028c76130000 00001002 2636 472 1080 5 27 2 0 0 LFH
0000028c767f0000 00041002 60 8 60 5 1 1 0 0
0000028c77020000 00041002 60 16 60 2 2 1 0 0
-------------------------------------------------------------------------------------
</code></pre>
<p>从卦中可以看到 <code>Heap address:0000028c75bb0000</code> 即为 block 地址,接下来使用 <code>!heap -x 0000028c786018a0</code> 观察这个 block 块的状态,可以看到此时确实是 free 的。</p>
<pre><code class="language-C#">
0:032> !heap -x 0000028c786018a0
Entry User Heap Segment SizePrevSizeUnused Flags
-------------------------------------------------------------------------------------------------------------
0000028c786018a00000028c786018b00000028c75bb00000000028c785c80d0 e0 - 0LFH;free
</code></pre>
<p>到这里问题的成因我们是完全搞清楚了,接下来就是反推问题代码的时候了。</p>
<h3 id="4-问题代码在哪里">4. 问题代码在哪里</h3>
<p>应该有朋友知道问题是在 <code>b.B+A.MoveNext()</code> 方法中,从名字上看这个项目应该是混淆的,有点搞哈。。。得要费点眼力,截图如下:</p>
<p><img src="https://img2024.cnblogs.com/blog/214741/202511/214741-20251112175330201-4174274.png" alt="" loading="lazy"></p>
<p>从卦中的 <code>IntPtr intPtr = Interlocked.Exchange(ref b.A, IntPtr.Zero);</code> 来看,这个 intPtr 是一个类级别变量,看样子是多个方法在操控类级别变量时没有合理的控制好,为了一探究竟,再次分析源代码,果然是的,截图如下:</p>
<p><img src="https://img2024.cnblogs.com/blog/214741/202511/214741-20251112175330087-75303353.png" alt="" loading="lazy"></p>
<p>到这里就真相大白了,让朋友修改源码自己控制好这个变量。</p>
<h2 id="三总结">三:总结</h2>
<p>这次生产事故是一个比较经典的 doublefree 问题,没接触过的话可能还是需要走一些弯路的,像我们这种老江湖,看到一二个特征这个问题就经注定解开!<br>
<img src="https://images.cnblogs.com/cnblogs_com/huangxincheng/345039/o_210929020104最新消息优惠促销公众号关注二维码.jpg" width="700" height="300" alt="图片名称" align="center"></p><br><br>
来源:https://www.cnblogs.com/huangxincheng/p/19214907
頁:
[1]