博客园出海记-组装集装箱:搭建 Kubernetes 集群
<p><img src="https://img2024.cnblogs.com/blog/35695/202509/35695-20250912095708546-103627121.jpg" alt="Benefits-of-Using-Kubernetes-810x506" width="600" loading="lazy"></p><p>在开篇中我们宣布了博客园出海计划的启航,出海航船选择了阿里云。</p>
<p>第一件准备工作是在航船上组装集装箱 —— 搭建 Kubernetes 集群。</p>
<p>出海根据地选在了阿里云新加坡机房,Kubernetes 集群用阿里云 ECS 自己搭建,没有使用阿里云容器服务 ACK。</p>
<p>首先购买一台 ECS 用于部署 Control Plane 节点,Control Plane 是指挥协调控制中心,不干具体活,所以不需要很高的配置,选择了2核4G的经济型 ECS 实例(ecs.e-c1m2.large),操作系统选用了 Ubuntu 24.04,加入新建的 kube 安全组(集群中的节点服务器都会加入这个安全组),主机名是 kube-cp-01。</p>
<h3 id="准备工作">准备工作</h3>
<h4 id="安装-k8s-三驾马车">安装 k8s 三驾马车</h4>
<p>安装 kubelet + kubeadm + kubectl,使用的版本是 1.33.4</p>
<p>安装所需的软件包</p>
<pre><code class="language-shell">apt-get update
apt-get install -y apt-transport-https ca-certificates curl gnupg
</code></pre>
<p>添加 k8e 安装源的签名秘钥</p>
<pre><code class="language-shell">curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.33/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
chmod 644 /etc/apt/keyrings/kubernetes-apt-keyring.gpg
</code></pre>
<p>添加 k8e 安装源</p>
<pre><code class="language-shell">echo 'deb https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
chmod 644 /etc/apt/sources.list.d/kubernetes.list
</code></pre>
<p>apt-get 命令安装三驾马车</p>
<pre><code class="language-shell">apt-get update
apt-get install -y kubelet kubectl kubeadm
</code></pre>
<p>确认版本</p>
<pre><code class="language-shell">~# kubelet --version
Kubernetes v1.33.4
~# kubectl version
Client Version: v1.33.4
~# kubeadm version -o short
v1.33.4
</code></pre>
<h4 id="配置网络">配置网络</h4>
<p>开启 IPv4 包转发</p>
<pre><code class="language-shell">echo "net.ipv4.ip_forward = 1" | tee /etc/sysctl.d/k8s.conf
sysctl --system
</code></pre>
<h4 id="安装容器运行时-containerd">安装容器运行时 containerd</h4>
<p>采用手动安装方式,安装的 containerd 版本是 2.1.4</p>
<p>下载并解压至 /usr/local</p>
<pre><code class="language-shell">wget -c https://github.com/containerd/containerd/releases/download/v2.1.4/containerd-2.1.4-linux-amd64.tar.gz
tar Cxzvf /usr/local containerd-2.1.4-linux-amd64.tar.gz
</code></pre>
<p>通过 systemd 自动运行 containerd</p>
<pre><code class="language-shell">mkdir -p /usr/local/lib/systemd/system
wget -c https://raw.githubusercontent.com/containerd/containerd/main/containerd.service -O /usr/local/lib/systemd/system/containerd.service
systemctl daemon-reload
systemctl enable --now containerd
</code></pre>
<p>安装 runc</p>
<pre><code class="language-shell">wget -c https://github.com/opencontainers/runc/releases/download/v1.3.1/runc.amd64
install -m 755 runc.amd64 /usr/local/sbin/runc
</code></pre>
<p>安装CNI 插件</p>
<pre><code class="language-shell">wget -c https://github.com/containernetworking/plugins/releases/download/v1.8.0/cni-plugins-linux-amd64-v1.8.0.tgz
mkdir -p /opt/cni/bin
tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v1.8.0.tgz
</code></pre>
<p>生成 containerd 配置</p>
<pre><code class="language-shell">mkdir /etc/containerd
containerd config default > /etc/containerd/config.toml
</code></pre>
<p>在 /etc/containerd/config.toml 中启用 SystemdCgroup</p>
<pre><code class="language-toml">
...
SystemdCgroup = true
</code></pre>
<p>重启 containerd 使配置生效</p>
<pre><code class="language-shell">systemctl restart containerd
</code></pre>
<p>安装 containerd 命令行工具 nerdctl</p>
<pre><code class="language-shell">wget -c https://github.com/containerd/nerdctl/releases/download/v2.1.4/nerdctl-2.1.4-linux-amd64.tar.gz
tar -zxf nerdctl-2.1.4-linux-amd64.tar.gz
mv nerdctl /usr/bin/nerdctl
</code></pre>
<p>将 nerdctl 的默认命名空间设置为 k8s.io</p>
<pre><code class="language-shell">mkdir /etc/nerdctl
echo 'namespace = "k8s.io"' | tee /etc/nerdctl/nerdctl.toml
</code></pre>
<h3 id="创建高可用集群">创建高可用集群</h3>
<p>在 /etc/hosts 中添加 control-plane-endpoint 的主机名解析</p>
<pre><code>127.0.0.1kube-api
</code></pre>
<p>用 kubeadm 命令创建集群</p>
<pre><code class="language-shell">kubeadm init \
--control-plane-endpoint "kube-api:6443"\
--upload-certs \
--pod-network-cidr=10.0.0.0/8 \
--skip-phases=addon/kube-proxy
</code></pre>
<p>注:没有安装 kube-proxy 是因为会用 cilium 取代它</p>
<p>出现下面的输出说明集群创建成功了</p>
<pre><code>Your Kubernetes control-plane has initialized successfully!
...
</code></pre>
<p>注:上面的输出内容中包含加入 control-plane 与 worker 节点的命令,后面会用到</p>
<p>用 nerdctl ps 命令查看容器运行情况</p>
<pre><code class="language-text">root@kube-cp-01 ~ # nerdctl ps
CONTAINER ID IMAGE COMMAND CREATED STATUS
f54d0fc6215a registry.k8s.io/kube-proxy:v1.33.4 "/usr/local/bin/kube…" About a minute ago Up
73cbd7ab3d69 registry.k8s.io/kube-scheduler:v1.33.4 "kube-scheduler --au…" About a minute ago Up
5ff05420d284 registry.k8s.io/kube-controller-manager:v1.33.4 "kube-controller-man…" About a minute ago Up
7031f91cfc16 registry.k8s.io/kube-apiserver:v1.33.4 "kube-apiserver --ad…" About a minute ago Up
2c79907098d4 registry.k8s.io/etcd:3.5.21-0 "etcd --advertise-cl…" About a minute ago Up
975b724b2814 registry.k8s.io/pause:3.10 "/pause" About a minute ago Up
</code></pre>
<p>添加 kubectl 用到的配置文件</p>
<pre><code class="language-shell">mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
</code></pre>
<p>查看节点运行情况</p>
<pre><code class="language-shell">root@kube-cp-01 ~ # kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-cp-01 Ready control-plane 2m41s v1.33.4
</code></pre>
<p>注:这里 control-plane node 处于 Ready 状态,如果用了 kube-proxy,要等完成安装 CNI 网络插件才会处于 Ready 状态</p>
<p>查看 pod 运行情况</p>
<pre><code class="language-shell">root@kube-cp-01 ~ # kubectl get pods -n kube-system 1 ↵
NAME READY STATUS RESTARTS AGE
coredns-674b8bbfcf-994fx 0/1 Pending 0 2m40s
coredns-674b8bbfcf-bsgdd 0/1 Pending 0 2m40s
etcd-kube-cp-01 1/1 Running 0 2m46s
kube-apiserver-kube-cp-01 1/1 Running 0 2m45s
kube-controller-manager-kube-cp-01 1/1 Running 0 2m44s
kube-proxy-vlvt9 1/1 Running 0 2m40s
kube-scheduler-kube-cp-01 1/1 Running 0 2m44s
</code></pre>
<p>coredns 处于 Pending 状态是因为还没安装 CNI 网络插件</p>
<h3 id="安装-cni-网络插件">安装 CNI 网络插件</h3>
<p>选用 cilium 作为 CNI(容器网络接口) 插件</p>
<p>安装 cilium cli</p>
<pre><code class="language-shell">wget -c https://github.com/cilium/cilium-cli/releases/download/v0.18.7/cilium-linux-amd64.tar.gz
tar xzvfC cilium-linux-amd64.tar.gz /usr/local/bin
</code></pre>
<p>安装 cilium</p>
<pre><code class="language-shell">root@kube-cp-01 ~ # cilium install --version 1.18.1 \
--namespace kube-system \
--set bpf.masquerade=true \
--set kubeProxyReplacement=true
ℹ️Using Cilium version 1.18.1
🔮 Auto-detected cluster name: kubernetes
🔮 Auto-detected kube-proxy has not been installed
ℹ️Cilium will fully replace all functionalities of kube-proxy
</code></pre>
<p>检查 cillium 的运行情况</p>
<pre><code class="language-shell">root@kube-cp-01 ~ # cilium status --wait
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: OK
\__/¯¯\__/ Hubble Relay: disabled
\__/ ClusterMesh: disabled
DaemonSet cilium Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet cilium-envoy Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Containers: cilium Running: 1
cilium-envoy Running: 1
cilium-operator Running: 1
clustermesh-apiserver
hubble-relay
Cluster Pods: 2/2 managed by Cilium
Helm chart version: 1.18.1
</code></pre>
<p>确认 cilium 已取代 kube-proxy</p>
<pre><code class="language-shell">root@kube-cp-01 ~ # kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep KubeProxyReplacement
KubeProxyReplacement: True
</code></pre>
<p>确认已启用 eBPF Host-Routing</p>
<pre><code class="language-shell">root@kube-cp-01 ~ # kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep BPF
Routing: Network: Tunnel Host: BPF
Masquerading: BPF 10.0.0.0/24
</code></pre>
<p>Cillium 成功部署后,coredns pod 也随之正常运行</p>
<pre><code class="language-shell">root@kube-cp-01 ~ # kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-674b8bbfcf-994fx 1/1 Running 0 8m46s
coredns-674b8bbfcf-bsgdd 1/1 Running 0 8m46s
</code></pre>
<h3 id="添加更多-control-plane-节点">添加更多 Control Plane 节点</h3>
<p>一共部署 3 个 control plane 节点,准备 2 台与 kube-cp-01 同样配置的2核4G阿里云 ecs,加入 kube 安全组,主机名分别为 kube-cp-02 与 kube-cp-03,用 kube-cp-01 的镜像创建系统。</p>
<p>分别登录这2台服务器,修改主机名,重置已有的 k8s 配置</p>
<pre><code class="language-shell">hostnamectl set-hostname kube-cp-02
kubeadm reset
</code></pre>
<p>在 /etc/hosts 中添加 kube-api 的解析,解析到 kube-cp-01 的 IP 地址</p>
<pre><code class="language-text">172.21.49.56kube-api
</code></pre>
<p>通过下面 kubeadm join 命令将服务器加入集群成为 control plane 节点</p>
<pre><code class="language-shell">kubeadm join kube-api:6443 --token xxxxxx \
--discovery-token-ca-cert-hash sha256:yyyyyy \
--control-plane --certificate-key zzzzzz \
-v=6
</code></pre>
<p>注:如果忘记之前 kubeadm init 创建集群时生成的 join 命令所需的 token + hash + key,可以通过下面的命令在 kube-cp-01 上生成</p>
<pre><code class="language-shell">kubeadm init phase upload-certs --upload-certs
kubeadm token create --print-join-command
</code></pre>
<p>在 /etc/hosts 中将 kube-api 解析到 127.0.0.1</p>
<pre><code class="language-text">127.0.0.1kube-api
</code></pre>
<p>这时通过 kubectl 命令就可以看到3个 control plane 节点</p>
<pre><code class="language-shell">root@kube-cp-03 ~ # kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-cp-01 Ready control-plane 2d6h v1.33.4
kube-cp-02 Ready control-plane 48m v1.33.4
kube-cp-03 Ready control-plane 6m18s v1.33.4
</code></pre>
<p>指挥协调控制中心三人组就这样组建好了。</p>
<p>接下来添加真正干活的 worker 节点,添加之前要部署负载均衡,worker 节点通过负载均衡访问 control plane 的 api server,按照 control plane 的指令与目标干活。</p>
<h3 id="部署负载均衡">部署负载均衡</h3>
<p>选用了阿里云网络型负载均衡(NLB),创建一个名为 kube-api 的私网 NLB</p>
<p><img src="https://img2024.cnblogs.com/blog/35695/202508/35695-20250820171610551-2047695641.png" alt="" width="600" loading="lazy"></p>
<p>创建 NLB 服务器组,将 3 台 control-plane 节点服务器加入服务器组</p>
<p><img src="https://img2024.cnblogs.com/blog/35695/202509/35695-20250911220104381-477343480.png" alt="" width="700" loading="lazy"></p>
<p>创建监听,监听协议是 TCP,端口是 6443,关联服务器组选择前一步创建的服务器组。</p>
<p><img src="https://img2024.cnblogs.com/blog/35695/202508/35695-20250830141111385-1638859711.png" alt="" width="600" loading="lazy"></p>
<p>k8s 集群的 control-plane-endpoint 主机名是 kube-api,NLB 的 endpoint 主机名是下面这个很长的三级域名,需要部署内网 dns 服务器进行 CNAME 解析</p>
<pre><code class="language-text">nlb-uaohnyerknl7eraukw2.ap-southeast-1.nlb.aliyuncsslbintl.com
</code></pre>
<p>选用了阿里云「云解析 PrivateZone」,在控制台添加一个域名,然后添加一个 CNAME 解析记录,将 kube-api 解析到阿里云负载均衡绑定的主机名</p>
<p><img src="https://img2024.cnblogs.com/blog/35695/202508/35695-20250830220345705-1859090249.jpg" alt="Screenshot 2025-08-30 at 22.02" width="660" loading="lazy"></p>
<p>登录到 kube-cp-01 服务器测试一下解析</p>
<pre><code class="language-shell">ping kube-api
PING nlb-uaohnyerknl7eraukw2.ap-southeast-1.com (172.21.49.53) 56(84) bytes of data.
64 bytes from 172.21.49.53: icmp_seq=1 ttl=102 time=0.378 ms
</code></pre>
<p>解析成功,负载均衡部署完成。</p>
<h3 id="添加-worker-节点">添加 Worker 节点</h3>
<p>准备一台4核8G的阿里云 ecs 作为 worker 节点,加入 kube 安全组,主机名设置为 kube-worker-01,也是用之前的镜像创建系统。</p>
<p>登录 kube-worker-01 服务器,删除 /etc/hosts 中的 kube-api 解析,之前部署的内网 dns 服务器会自动进行解析。</p>
<p>用 kubeadm reset 命令重置 k8s 配置,通过 kubeadm join 命令将这台服务器作为 worker 节点加入集群</p>
<pre><code class="language-shell">kubeadm join kube-api:6443 --token xxxxxx \
--discovery-token-ca-cert-hash sha256:yyyyyy
</code></pre>
<p>出现下面的输出,说明成功加入</p>
<pre><code class="language-text">This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
</code></pre>
<p>登录到其中一台 control-plane 查看集群中的节点情况</p>
<pre><code class="language-shell">root@kube-cp-01 ~ # kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-cp-01 Ready control-plane 2d12h v1.33.4
kube-cp-02 Ready control-plane 6h45m v1.33.4
kube-cp-03 Ready control-plane 87m v1.33.4
kube-worker-01 Ready <none> 54m v1.33.4
</code></pre>
<p>3 个 control-plane 节点,1 个 worker 节点都运行正常,k8s 集群部署完成,集装箱准备就绪,下一步就是往集装箱中装包裹(部署 pod),会在后续的博文中分享。</p>
<h3 id="搭建中遇到的问题">搭建中遇到的问题</h3>
<p>开始安装的是最新版 kubernetes 1.34,在部署最新版 cilium 1.18.1 时发现 cilium 不兼容 k8s 1.34,只能换成 k8s 1.33.4。</p>
<p>本来想参考这篇博文试试 cilium 的高科技,用 k8s service 取代阿里云负载均衡作为 control plane 的负载均衡,但实验失败,等以后找时间再研究。</p>
<h3 id="结语">结语</h3>
<p>让大家久等了,出海记第2篇博文姗姗来太迟,因为这段时间太忙了,有时搭建到中途,竟然连续几天抽不出时间继续搭建。</p>
<p>接下来会更忙,这段时间和华为达成了 HarmonyOS 的推广和专区建设合作,接下来要重点忙于 HarmonyOS 专区的搭建与运营,出海记的分享会更受影响,会考虑蚂蚁搬家式地一点一点分享,比如部署 redis 分享一篇,部署 dapr 分享一篇,直到找到负责 HarmonyOS 合作项目的运营人才加入团队,出海的步伐才能加快。</p>
<p>另外,园子办公室隔壁的「云栖开发者基地」装修好了,以后杭州的园友可以有固定的地方线下交流了,园子的出海也可以在线下探讨交流了。</p><br><br>
来源:https://www.cnblogs.com/cmt/p/19036702
頁:
[1]