4.prometheus监控--监控linux服务器
<h3>一、监控linux服务器</h3><h4>1.1 二进制安装</h4>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)"># 客户端操作<br>wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
ls -l
mv node_exporter-1.7.0.linux-amd64/* /opt/prometheus/node_exporter<br>useradd -M -s /usr/sbin/nologin prometheus<br>chown prometheus:prometheus -R /opt/prometheus/node_exporter<br></span></pre>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)"># 创建system服务
</span><span style="color: rgba(0, 0, 255, 1)">cat</span> > /etc/systemd/system/node_exporter.service <<<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">EOF</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
Description</span>=<span style="color: rgba(0, 0, 0, 1)">node_exporter
Documentation</span>=https:<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">prometheus.io/</span>
After=<span style="color: rgba(0, 0, 0, 1)">network.target
User</span>=<span style="color: rgba(0, 0, 0, 1)">prometheus
Group</span>=<span style="color: rgba(0, 0, 0, 1)">prometheus
ExecStart</span>=/opt/prometheus/node_exporter/<span style="color: rgba(0, 0, 0, 1)">node_exporter
Restart</span>=on-<span style="color: rgba(0, 0, 0, 1)">failure
WantedBy</span>=multi-<span style="color: rgba(0, 0, 0, 1)">user.target
EOF
# 启动服务
systemctl daemon</span>-<span style="color: rgba(0, 0, 0, 1)">reload
systemctl start node_exporter.service
systemctl enable node_exporter.service
systemctl status node_exporter.service
journalctl </span>-u node_exporter.service -f# 查看日志</pre>
</div>
<pre></pre>
<p>服务端操作</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)"># 修改prometheus配置
nano </span>/opt/prometheus/prometheus/<span style="color: rgba(0, 0, 0, 1)">prometheus.yml
# 再scrape_configs这行下面添加如下配置:
#node</span>-<span style="color: rgba(0, 0, 0, 1)">exporter配置
</span>- job_name: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">node-exporter</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
scrape_interval: 15s
static_configs:
</span>- targets: [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">192.168.10.14:9100</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
labels:
instance: test服务器
# 重载prometheus
curl </span>-X POST http:<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">localhost:9090/-/reload</span></pre>
</div>
</div>
<h4>1.2 docker 或docker-compose安装客户端</h4>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)"># docker安装
docker run </span>-d -p <span style="color: rgba(128, 0, 128, 1)">9100</span>:<span style="color: rgba(128, 0, 128, 1)">9100</span><span style="color: rgba(0, 0, 0, 1)"> \
</span>-v <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/proc:/host/proc:ro</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)"> \
</span>-v <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/sys:/host/sys:ro</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)"> \
</span>-v <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/:/rootfs:ro</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)"> \
</span>--net=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">host</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)"> \
prom</span>/node-<span style="color: rgba(0, 0, 0, 1)">exporter
# docker</span>-<span style="color: rgba(0, 0, 0, 1)">compose安装
</span><span style="color: rgba(0, 0, 255, 1)">mkdir</span> /data/node_exporter -<span style="color: rgba(0, 0, 0, 1)">p
cd </span>/data/<span style="color: rgba(0, 0, 0, 1)">node_exporter
</span><span style="color: rgba(0, 0, 255, 1)">cat</span> > docker-compose.yaml <<<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">EOF</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
version: </span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">3.3</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
services:
node_exporter:
image: prom</span>/node-exporter:v1.5<span style="color: rgba(128, 0, 128, 1)">.0</span><span style="color: rgba(0, 0, 0, 1)">
container_name: node</span>-<span style="color: rgba(0, 0, 0, 1)">exporter
restart: always
network_mode: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">host</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
volumes:
</span>- /proc:/host/<span style="color: rgba(0, 0, 0, 1)">proc:ro
</span>- /sys:/host/<span style="color: rgba(0, 0, 0, 1)">sys:ro
</span>- /:/<span style="color: rgba(0, 0, 0, 1)">rootfs:ro
command:
</span>- <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">--web.listen-address=:9100</span><span style="color: rgba(128, 0, 0, 1)">'</span>
- <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">--path.procfs=/host/proc</span><span style="color: rgba(128, 0, 0, 1)">'</span>
- <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">--path.sysfs=/host/sys</span><span style="color: rgba(128, 0, 0, 1)">'</span>
- <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">--path.rootfs=/rootfs</span><span style="color: rgba(128, 0, 0, 1)">"</span>
- <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
EOF
# 启动和检查
docker</span>-compose up -<span style="color: rgba(0, 0, 0, 1)">d
docker </span><span style="color: rgba(0, 0, 255, 1)">ps</span><span style="color: rgba(0, 0, 0, 1)">
或:
docker logs </span>-f node-exporter<br><br>http://192.168.10.100:9100/metrics</pre>
</div>
<p>修改服务端prometheus.yml配置文件</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)"># 增加
</span>- job_name: <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">node-exporter</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">
scrape_interval: 15s
static_configs:
</span>- targets: [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">node_exporter:9100</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
labels:
instance: Prometheus服务器
</span>- targets: [<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">192.168.10.100:9100</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
labels:
instance: test服务器
# 重载服务端配置
curl </span>-X POST http:<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">localhost:9090/-/reload</span></pre>
</div>
<p><img src="https://img2024.cnblogs.com/blog/1523753/202404/1523753-20240424154132797-1305542463.png"></p>
<h3> 二、常用监控指标</h3>
<p>查看:http://192.168.10.14:9090/graph</p>
<h4>2.1 cpu采集</h4>
<p>node_cpu_seconds_total</p>
<table border="0" align="left">
<tbody>
<tr>
<td>名称</td>
<td>含义</td>
</tr>
<tr>
<td>node_load1</td>
<td>1分钟内cpu负载</td>
</tr>
<tr>
<td>node_load5</td>
<td>5分钟内cpu负载</td>
</tr>
<tr>
<td>node_load15</td>
<td>15分钟内cpu负载</td>
</tr>
</tbody>
</table>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p><img src="https://img2024.cnblogs.com/blog/1523753/202404/1523753-20240424155456773-1073674812.png"></p>
<h4>2.2 内存采集/proc/meminfo文件</h4>
<p>node_memory_</p>
<p>node_memory_MemTotal_bytes 内存总大小</p>
<div class="lake-content">
<p id="u9b4e111a" class="ne-p"><span class="ne-text">node_memory_MemAvailable_bytes 空闲可使用内存大小(=free+buffer+cache)</span></p>
</div>
<div class="lake-content">
<p id="u8865aa33" class="ne-p"><span class="ne-text">node_memory_MemFree_bytes 空闲物理内存大小</span></p>
</div>
<h4>2.3 磁盘采集</h4>
<p>node_disk_</p>
<div class="lake-content">
<div class="lake-content">
<p class="ne-p"><span class="ne-text">node_disk_read_bytes_total </span><span class="ne-text">自exporter启动以来从磁盘读取的总字节数</span></p>
</div>
</div>
<div class="lake-content">
<p class="ne-p"><strong><span class="ne-text">node_disk_written_bytes_total </span></strong><span class="ne-text">自exporter启动以来写入到磁盘的总字节数</span></p>
</div>
<h4>2.4 文件系统采集</h4>
<div class="lake-content">
<p class="ne-p"><span class="ne-text">node_filesystem_avail_bytes </span><span class="ne-text">空闲磁盘大小,单位字节 </span><span class="ne-text">/1024/1024=MB,/1024/1024/1024=GB</span></p>
</div>
<div class="lake-content">
<p id="ua1133ac0" class="ne-p"><span class="ne-text">node_filesystem_size_bytes 磁盘总大小</span></p>
<div class="lake-content">
<p id="ucc57a99d" class="ne-p"><span class="ne-text">node_filesystem_files_free 空inode大小,单位个</span></p>
<div class="lake-content">
<p id="uaa19793e" class="ne-p"><span class="ne-text">node_filesystem_files inode总大小,大卫个</span></p>
<h4 class="ne-p"><span class="ne-text">2.5 网络采集</span></h4>
<p class="ne-p"><span class="ne-text">node_network_</span></p>
<div class="lake-content">
<p class="ne-p"><span class="ne-text">node_network_transmit_bytes_total </span><span class="ne-text">网络流出流量,单位字节(Byte)</span><span class="ne-text">/1024/1024=Mb/s</span></p>
<div class="lake-content"><span class="ne-text"><span class="ne-text">node_network_receive_bytes_tota </span></span><span class="ne-text">网络流入流量,单位字节(Byte)</span></div>
</div>
</div>
</div>
</div>
<p>2.6 文件描述符</p>
<p>node_filefd_allocated: 已分配的文件描述符数。通过cat /proc/sys/fs/file-nr查看<br>node_filefd_maximum: 系统支持的最大文件描述符数,通过/proc/sys/fs/file-max或/proc/sys/fs/file-nr</p>
<h4>2.7 进程文件描述符</h4>
<p>process_max_fds: 进程可打开的最大文件描述符数。<br>process_open_fds: node_exporter进程当前打开的文件描述符数。 通过ls /proc/$PID/fd 2>/dev/null | wc -l 计算</p>
<h4>2.8 socket</h4>
<p>node_sockstat_sockets_used # 使用的 Socket 数<br>node_sockstat_TCP_inuse # 监听的 TCP Socket 数<br>node_sockstat_TCP_tw</p>
<h4>2.9 TCP/UDP协议</h4>
<p>node_netstat_Tcp_CurrEstab # ESTABLISHED 加 CLOSE_WAIT 状态的 TCP 连接数<br>node_netstat_Tcp_InSegs # 接收的 TCP 包数(包括错误的)<br>node_netstat_Tcp_InErrs # 接收的 TCP 错误包数(比如校验和错误)<br>node_netstat_Tcp_OutSegs # 发送的 TCP 包数<br>node_netstat_Udp_InDatagrams # 接收的 UDP 包数<br>node_netstat_Udp_InErrors # 接收的 UDP 错误包数<br>node_netstat_Udp_OutDatagrams # 发送的 UDP 包数</p>
<h3> 三、触发器设置</h3>
<div class="cnblogs_code">
<pre>cd /data/docker-prometheus/
<span style="color: rgba(0, 0, 255, 1)">cat</span> >> prometheus/alert.yml <<<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">EOF</span><span style="color: rgba(128, 0, 0, 1)">"</span>
- name: node-<span style="color: rgba(0, 0, 0, 1)">exporter
rules:
</span>-<span style="color: rgba(0, 0, 0, 1)"> alert: HostOutOfMemory
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * <span style="color: rgba(128, 0, 128, 1)">100</span> < <span style="color: rgba(128, 0, 128, 1)">10</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">主机内存不足,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">内存可用率<10%,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostMemoryUnderMemoryPressure
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: rate(node_vmstat_pgmajfault) > <span style="color: rgba(128, 0, 128, 1)">1000</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">内存压力不足,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">节点内存压力大。 重大页面错误率高,当前值为:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostUnusualNetworkThroughputIn
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: <span style="color: rgba(0, 0, 255, 1)">sum</span> by (instance) (rate(node_network_receive_bytes_total)) / <span style="color: rgba(128, 0, 128, 1)">1024</span> / <span style="color: rgba(128, 0, 128, 1)">1024</span> > <span style="color: rgba(128, 0, 128, 1)">100</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 5m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常流入网络吞吐量,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">网络流入流量 > 100 MB/s,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostUnusualNetworkThroughputOut
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: <span style="color: rgba(0, 0, 255, 1)">sum</span> by (instance) (rate(node_network_transmit_bytes_total)) / <span style="color: rgba(128, 0, 128, 1)">1024</span> / <span style="color: rgba(128, 0, 128, 1)">1024</span> > <span style="color: rgba(128, 0, 128, 1)">100</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 5m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常流出网络吞吐量,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">网络流出流量 > 100 MB/s,当前值为:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostUnusualDiskReadRate
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: <span style="color: rgba(0, 0, 255, 1)">sum</span> by (instance) (rate(node_disk_read_bytes_total)) / <span style="color: rgba(128, 0, 128, 1)">1024</span> / <span style="color: rgba(128, 0, 128, 1)">1024</span> > <span style="color: rgba(128, 0, 128, 1)">50</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 5m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常磁盘读取,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘读取> 50 MB/s,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostUnusualDiskWriteRate
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: <span style="color: rgba(0, 0, 255, 1)">sum</span> by (instance) (rate(node_disk_written_bytes_total)) / <span style="color: rgba(128, 0, 128, 1)">1024</span> / <span style="color: rgba(128, 0, 128, 1)">1024</span> > <span style="color: rgba(128, 0, 128, 1)">50</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常磁盘写入,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘写入> 50 MB/s,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostOutOfDiskSpace
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: (node_filesystem_avail_bytes * <span style="color: rgba(128, 0, 128, 1)">100</span>) / node_filesystem_size_bytes < <span style="color: rgba(128, 0, 128, 1)">10</span> and ON (instance, device, mountpoint) node_filesystem_readonly == <span style="color: rgba(128, 0, 128, 1)">0</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘空间不足告警,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">剩余磁盘空间< 10% ,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostDiskWillFillIn24Hours
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: (node_filesystem_avail_bytes * <span style="color: rgba(128, 0, 128, 1)">100</span>) / node_filesystem_size_bytes < <span style="color: rgba(128, 0, 128, 1)">10</span> and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">tmpfs</span><span style="color: rgba(128, 0, 0, 1)">"</span>}, <span style="color: rgba(128, 0, 128, 1)">24</span> * <span style="color: rgba(128, 0, 128, 1)">3600</span>) < <span style="color: rgba(128, 0, 128, 1)">0</span> and ON (instance, device, mountpoint) node_filesystem_readonly == <span style="color: rgba(128, 0, 128, 1)">0</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘空间将在24小时内耗尽,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">以当前写入速率预计磁盘空间将在 24 小时内耗尽,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostOutOfInodes
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: node_filesystem_files_free{mountpoint =<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/</span><span style="color: rgba(128, 0, 0, 1)">"</span>} / node_filesystem_files{mountpoint=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/</span><span style="color: rgba(128, 0, 0, 1)">"</span>} * <span style="color: rgba(128, 0, 128, 1)">100</span> < <span style="color: rgba(128, 0, 128, 1)">10</span> and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/</span><span style="color: rgba(128, 0, 0, 1)">"</span>} == <span style="color: rgba(128, 0, 128, 1)">0</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘Inodes不足,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">剩余磁盘 inodes < 10%,当前值: {{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostUnusualDiskReadLatency
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: rate(node_disk_read_time_seconds_total) / rate(node_disk_reads_completed_total) > <span style="color: rgba(128, 0, 128, 1)">0.1</span> and rate(node_disk_reads_completed_total) > <span style="color: rgba(128, 0, 128, 1)">0</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常磁盘读取延迟,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘读取延迟 > 100ms,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostUnusualDiskWriteLatency
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: rate(node_disk_write_time_seconds_total) / rate(node_disk_writes_completed_total) > <span style="color: rgba(128, 0, 128, 1)">0.1</span> and rate(node_disk_writes_completed_total) > <span style="color: rgba(128, 0, 128, 1)">0</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常磁盘写入延迟,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘写入延迟 > 100ms,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: high_load
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: node_load1 > <span style="color: rgba(128, 0, 128, 1)">4</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: page
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">CPU1分钟负载过高,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">CPU1分钟负载>4,已经持续2分钟。当前值为:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostCpuIsUnderUtilized
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: <span style="color: rgba(128, 0, 128, 1)">100</span> - (avg by(instance) (rate(node_cpu_seconds_total{mode=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">idle</span><span style="color: rgba(128, 0, 0, 1)">"</span>})) * <span style="color: rgba(128, 0, 128, 1)">100</span>) > <span style="color: rgba(128, 0, 128, 1)">80</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 1m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">cpu负载高,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">cpu负载> 80%,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostCpuStealNoisyNeighbor
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: avg by(instance) (rate(node_cpu_seconds_total{mode=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">steal</span><span style="color: rgba(128, 0, 0, 1)">"</span>})) * <span style="color: rgba(128, 0, 128, 1)">100</span> > <span style="color: rgba(128, 0, 128, 1)">10</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 0m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">CPU窃取率异常,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">CPU 窃取率 > 10%。 嘈杂的邻居正在扼杀 VM 性能,或者 Spot 实例可能失去信用,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostSwapIsFillingUp
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: (<span style="color: rgba(128, 0, 128, 1)">1</span> - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * <span style="color: rgba(128, 0, 128, 1)">100</span> > <span style="color: rgba(128, 0, 128, 1)">80</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘swap空间使用率异常,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">磁盘swap空间使用率>80%</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostNetworkReceiveErrors
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: rate(node_network_receive_errs_total) / rate(node_network_receive_packets_total) > <span style="color: rgba(128, 0, 128, 1)">0.01</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常网络接收错误,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">网卡{{ $labels.device }}在过去2分钟接收错误率大于0.01,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostNetworkTransmitErrors
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: rate(node_network_transmit_errs_total) / rate(node_network_transmit_packets_total) > <span style="color: rgba(128, 0, 128, 1)">0.01</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常网络传输错误,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">网卡{{ $labels.device }}在过去2分钟传输错误率大于0.01,当前值:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostNetworkInterfaceSaturated
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: (rate(node_network_receive_bytes_total{device!~<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">^tap.*</span><span style="color: rgba(128, 0, 0, 1)">"</span>}) + rate(node_network_transmit_bytes_total{device!~<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">^tap.*</span><span style="color: rgba(128, 0, 0, 1)">"</span>})) / node_network_speed_bytes{device!~<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">^tap.*</span><span style="color: rgba(128, 0, 0, 1)">"</span>} > <span style="color: rgba(128, 0, 128, 1)">0.8</span> < <span style="color: rgba(128, 0, 128, 1)">10000</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 1m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常网络接口饱和,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">网卡{{ $labels.device }}正在超载,当前值{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostConntrackLimit
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > <span style="color: rgba(128, 0, 128, 1)">0.8</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 5m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常连接数,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">连接数过大,当前连接数:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostClockSkew
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: (node_timex_offset_seconds > <span style="color: rgba(128, 0, 128, 1)">0.05</span> and deriv(node_timex_offset_seconds) >= <span style="color: rgba(128, 0, 128, 1)">0</span>) or (node_timex_offset_seconds < -<span style="color: rgba(128, 0, 128, 1)">0.05</span> and deriv(node_timex_offset_seconds) <= <span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">异常时钟偏差,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">检测到时钟偏差,时钟不同步。值为:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: HostClockNotSynchronising
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: min_over_time(node_timex_sync_status) == <span style="color: rgba(128, 0, 128, 1)">0</span> and node_timex_maxerror_seconds >= <span style="color: rgba(128, 0, 128, 1)">16</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 2m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">时钟不同步,实例:{{ $labels.instance }}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">时钟不同步</span><span style="color: rgba(128, 0, 0, 1)">"</span>
-<span style="color: rgba(0, 0, 0, 1)"> alert: NodeFileDescriptorLimit
</span><span style="color: rgba(0, 0, 255, 1)">expr</span>: node_filefd_allocated / node_filefd_maximum * <span style="color: rgba(128, 0, 128, 1)">100</span> > <span style="color: rgba(128, 0, 128, 1)">80</span>
<span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)">: 1m
labels:
severity: warning
annotations:
summary: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">预计内核将很快耗尽文件描述符限制</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
description: </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">{{ $labels.instance }}}已分配的文件描述符数超过了限制的80%,当前值为:{{ $value }}</span><span style="color: rgba(128, 0, 0, 1)">"<br></span></pre>
<p> - alert: tcp 已建立的连接数超过 40000<br> expr: node_netstat_Tcp_CurrEstab > 40000<br> for: 3m<br> labels:<br> severity: warning<br> annotations:<br> summary: "主机连接数过多,实例:{{ $labels.instance }}"<br> description: "当前值:{{ $value }}"</p>
<pre><span style="color: rgba(0, 0, 0, 1)">EOF</span></pre>
</div>
<p>检查rule配置文件是否有问题:</p>
<div class="cnblogs_code">
<pre>docker exec -it prometheus promtool check config /etc/prometheus/<span style="color: rgba(0, 0, 0, 1)">prometheus.yml
# 重载配置
curl </span>-X POST http:<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">localhost:9090/-/reload</span></pre>
</div>
<p>查看:http://192.168.10.14:9090/alerts?search=,可以发现alters多了刚才定义的</p>
<p><img src="https://img2024.cnblogs.com/blog/1523753/202404/1523753-20240424160057845-1275063421.png"></p>
<h3> 四、grafana展示node-exporter数据</h3>
<p>由于grafana已经添加了1860模版,可以直接登录查看</p>
<p>http://192.168.10.14:3000/login</p>
<p><img src="https://img2024.cnblogs.com/blog/1523753/202404/1523753-20240424160442639-1429058656.png"></p>
<p> </p><br><br>
来源:https://www.cnblogs.com/yangmeichong/p/18155650
頁:
[1]