《.NET 5.0 背锅案》第6集-案发现场回顾：故障情况下 Kubernetes 部署表现

世界人民团结起来 發表於 2020-11-18 19:14:00

《.NET 5.0 背锅案》第6集-案发现场回顾：故障情况下 Kubernetes 部署表现

<ul>
<li>第1集：验证 .NET 5.0 正式版 docker 镜像问题</li>
<li>第2集：码中的小窟窿，背后的大坑，发现重要嫌犯 EnyimMemcachedCore</li>
<li>第3集-剧情反转：EnyimMemcachedCore 无罪，.NET 5.0 继续背锅</li>
<li>第4集：一个.NET，两手准备，一个issue，加倍关注</li>
<li>第5集-案情突破：都是我们的错，让 .NET 5.0 背锅</li>
<li>第6集-案发现场回顾：故障情况下 Kubernetes 的部署表现</li>
</ul>
我们的博客系统是部署在用阿里云服务器自己搭建的 Kubernetes 集群上，故障在 k8s 部署更新 pod 的过程中就出现了，昨天发布时，我们特地观察一下，在这1集中分享一下。
在部署过程中，k8s 会进行3个阶段的 pod 更新操作：
<ol>
<li>"xxx new replicas have been updated"</li>
<li>"xxx replicas are pending termination"</li>
<li>"xxx updated replicas are available"</li>
</ol>
正常发布情况下，整个部署操作通常在5-8分钟左右完成（这与livenessProbe和readinessProbe的配置有关），下面是部署期间的控制台输出
<pre><code class="language-text">Waiting for deployment "blog-web" rollout to finish: 4 out of 8 new replicas have been updated...
Waiting for deployment spec update to be observed...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
...
Waiting for deployment "blog-web" rollout to finish: 4 old replicas are pending termination...
...
Waiting for deployment "blog-web" rollout to finish: 14 of 15 updated replicas are available...
deployment "blog-web" successfully rolled out
</code></pre>
而在故障场景下，整个部署操作需要在15分钟左右才能完成，3个阶段的 pod 更新都比正常情况下慢，尤其是"old replicas are pending termination"阶段。
在部署期间通过 <code>kubectl get pods -l app=blog-web -o wide</code> 命令查看 pod 的状态，新部署的 pod 处于 Running 状态，说明 livenessProbe 健康检查成功，但多数 pod 没有进入 ready 状态，说明这些 pod 的 readinessProbe 健康检查失败，restarts 大于0 说明 livenessProbe 健康检查失败对 pod 进行了重启。
<pre><code class="language-text">NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
blog-web-55d5677cf-2854n 0/1 Running 1 5m1s 192.168.107.213 k8s-node3 <none> <none>
blog-web-55d5677cf-7vkqb 0/1 Running 2 6m17s 192.168.228.33 k8s-n9 <none> <none>
blog-web-55d5677cf-8gq6n 0/1 Running 2 5m29s 192.168.102.235 k8s-n19 <none> <none>
blog-web-55d5677cf-g8dsr 0/1 Running 2 5m54s 192.168.104.78 k8s-node11 <none> <none>
blog-web-55d5677cf-kk9mf 0/1 Running 2 6m9s 192.168.42.3 k8s-n13 <none> <none>
blog-web-55d5677cf-kqwzc 0/1 Pending 0 4m44s <none> <none> <none> <none>
blog-web-55d5677cf-lmbvf 0/1 Running 2 5m54s 192.168.201.123 k8s-n14 <none> <none>
blog-web-55d5677cf-ms2tk 0/1 Pending 0 6m9s <none> <none> <none> <none>
blog-web-55d5677cf-nkjrd 1/1 Running 2 6m17s 192.168.254.129 k8s-n7 <none> <none>
blog-web-55d5677cf-nnjdx 0/1 Pending 0 4m48s <none> <none> <none> <none>
blog-web-55d5677cf-pqgpr 0/1 Pending 0 4m33s <none> <none> <none> <none>
blog-web-55d5677cf-qrjr5 0/1 Pending 0 2m38s <none> <none> <none> <none>
blog-web-55d5677cf-t5wvq 1/1 Running 3 6m17s 192.168.10.100 k8s-n12 <none> <none>
blog-web-55d5677cf-w52xc 1/1 Running 3 6m17s 192.168.73.35 k8s-node10 <none> <none>
blog-web-55d5677cf-zk559 0/1 Running 1 5m21s 192.168.118.6 k8s-n4 <none> <none>
blog-web-5b57b7fcb6-7cbdt 1/1 Running 2 18m 192.168.168.77 k8s-n6 <none> <none>
blog-web-5b57b7fcb6-cgfr4 1/1 Running 4 19m 192.168.89.250 k8s-n8 <none> <none>
blog-web-5b57b7fcb6-cz278 1/1 Running 3 19m 192.168.218.99 k8s-n18 <none> <none>
blog-web-5b57b7fcb6-hvzwp 1/1 Running 3 18m 192.168.195.242 k8s-node5 <none> <none>
blog-web-5b57b7fcb6-rhgkq 1/1 Running 1 16m 192.168.86.126 k8s-n20 <none> <none>
</code></pre>
在我们的 k8e deployment 配置中 livenessProbe 与 readinessProbe 检查的是同一个地址，具体配置如下
<pre><code class="language-yaml">livenessProbe:
httpGet:
path: /
port: 80
httpHeaders:
- name: X-Forwarded-Proto
 value: https
- name: Host
 value: www.cnblogs.com
initialDelaySeconds: 30
periodSeconds: 3
successThreshold: 1
failureThreshold: 5
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /
port: 80
httpHeaders:
- name: X-Forwarded-Proto
 value: https
- name: Host
 value: www.cnblogs.com
initialDelaySeconds: 40
periodSeconds: 5
successThreshold: 1
failureThreshold: 5
timeoutSeconds: 5
</code></pre>
由于潜藏的并发问题造成 livenessProbe 与 readinessProbe 健康检查频繁失败，造成 k8s 更新 pod 的过程跌跌撞撞，在这个过程中，由于有部分旧 pod 分担负载，新 pod 出现问题会暂停更新，等正在部署的 pod 恢复正常，所以这时故障的影响局限在一定范围内，访问网站的表现是时好时坏。
这个跌跌撞撞的艰难部署过程最终会完成，而部署完成之际，就是故障全面爆发之时。部署完成后，新 pod 全面接管负载，存在并发问题的新 pod 在并发请求的重压下溃不成军，多个 pod 因 livenessProbe 健康检查失败被重启，重启后因为 readinessProbe 健康检查失败很难进入 ready 状态分担负载，仅剩的 pod 不堪重负，CrashLoopBackOff 此起彼伏，在源源不断的并发请求的冲击下，始终没有足够的 pod 应付当前的负载，故障就一直无法恢复。 
来源：https://www.cnblogs.com/cmt/p/13999061.html

頁: [1]

圆梦公社's Archiver

《.NET 5.0 背锅案》第6集-案发现场回顾：故障情况下 Kubernetes 部署表现