簡介:由於資源有限,本實驗用了兩台機器 監控端:部署prometheus、grafana、alertmanager 被監控端:node_exporter、mysqld_exporter 一. 部署promethus 1. 下載 https://prometheus.io/download/ 2. 解 ...
簡介:由於資源有限,本實驗用了兩台機器
- 監控端:部署prometheus、grafana、alertmanager
- 被監控端:node_exporter、mysqld_exporter
一. 部署promethus
1. 下載
https://prometheus.io/download/
2. 解壓
mkdir -p /data/prometheus
tar -zxvf /root/prometheus-2.42.0.linux-amd64.tar.gz -C /data/
cd /data
mv prometheus-2.42.0.linux-amd64/ prometheus
3. 部署
- 創建prometheus用戶
useradd -s /sbin/nologin -M prometheus
mkdir -p /data/database/prometheus
chown -R prometheus:prometheus /data/database/prometheus/
- 配置systemctl啟動項
vim /etc/systemd/system/prometheus.service
[Unit] Description=Prometheus Documentation=https://prometheus.io/ After=network.target [Service] Type=simple User=prometheus ExecStart=/data/prometheus/prometheus --web.enable-lifecycle --config.file=/data/prometheus/prometheus.yml --storage.tsdb.path=/data/database/prometheus Restart=on-failure [Install] WantedBy=multi-user.target
4. 載入配置&啟動服務
systemctl daemon-reload
systemctl start prometheus
systemctl status prometheus
systemctl enable prometheus
-
訪問web頁面,IP:9090
-
查看到監控的數據,IP:9090/metrics
二. 監控linux主機
1. 下載node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
2.解壓
tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz -C /data/
mv /data/node_exporter-1.5.0.linux-amd64/ /data/node_exporter
3. 配置systemctl啟動項
vim /etc/systemd/system/node_exporter.service
[Unit] Description=node_exporter [Service] ExecStart=/data/source.package/node_exporter-1.1.2.linux-amd64/node_exporter ExecReload=/bin/kill -HUP $MAINPID KillMode=process Restart=on-failure [Install] WantedBy=multi-user.target
4. 載入配置&啟動服務
systemctl daemon-reload
systemctl start node_exporter.service
systemctl status node_exporter.service
systemctl enable node_exporter.service
- 查看到被監控的數據,IP:9100/metrics
5. 監控端配置
-
在主配置文件最後加上下麵三行
vim /data/prometheus/prometheus.yml
- job_name: 'agent1' #取一個job名稱來代表被監控的機器 static_configs: - targets: ['192.168.1.1:9100'] # 這裡改成被監控機器的IP,後面埠接9100
- 測試prometheus.yaml文件有無報錯
[root@VM-16-2-centos prometheus]# ./promtool check config prometheus.yml Checking prometheus.yml SUCCESS: prometheus.yml is valid prometheus config file syntax
6. 重新載入prometheus配置文件
-
curl -X POST http://127.0.0.1:9090/-/reload
,打開prometheus頁面輸入up查看是不是有對應的數據了 -
回到web管理界面 ——>點——>點Targets ——>可以看到多了一臺監控目標
三. 監控mysql
1. 下載mysqld_exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.14.0/mysqld_exporter-0.14.0.linux-amd64.tar.gz2
2. 解壓
tar -zxvf mysqld_exporter-0.14.0.linux-amd64.tar.gz -C /data/
mv /data/mysqld_exporter-0.14.0.linux-amd64/ /data/mysqld_exporter
[root@VM-12-2-centos ~]# ls /data/mysqld_exporter/ LICENSE mysqld_exporter NOTICE
3. 安裝mariadb資料庫,並授權
yum -y install mariadb-server -y
systemctl start mariadb
[root@VM-12-2-centos ~]# mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 2 Server version: 5.5.68-MariaDB MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> MariaDB [(none)]> grant select,replication client,process ON *.* to 'mysql_monito'@'localhost' identified by '123'; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> MariaDB [(none)]> flush privileges; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> MariaDB [(none)]> exit Bye
4. 啟動
nohup /usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf &
5. 監控端配置
vim /data/prometheus/prometheus.yml
- job_name: 'mysql' #取一個job名稱來代表被監控的機器 static_configs: - targets: ['192.168.1.1:9104'] # 這裡改成被監控機器的IP,後面埠接9104
6. 重啟prometheus
systemctl restart prometheus
- 回到web管理界面 ——>點——>點Targets ——>可以看到多了一臺監控目標
四. 部署grafana
1. 下載
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-9.3.6.linux-amd64.tar.gz
2. 解壓
tar -zxvf grafana-enterprise-9.3.6.linux-amd64.tar.gz -C /data
mv grafana-9.3.6/ grafana
3. 修改初始化文件
- 備份
cp /data/grafana/conf/defaults.ini /data/grafana/conf/defaults.ini.bak
- 修改
vim /data/grafana/conf/defaults.ini
data = /data/database/grafana/data logs = /data/database/grafana/log plugins = /data/database/grafana/plugins provisioning = /data/grafana/conf/provisioning/
4. 配置systemctl啟動項
vim /etc/systemd/system/grafana-server.service
[Unit] Description=Grafana After=network.target [Service] User=grafana Group=grafana Type=notify ExecStart=/data/grafana/bin/grafana-server -homepath /data/grafana/ Restart=on-failure [Install] WantedBy=multi-user.target
5. 載入配置&啟動服務
systemctl daemon-reload
systemctl start grafana-server.service
systemctl status grafana-server.service
systemctl enable grafana-server.service
-
web頁面:ip+3000
- 預設賬號密碼都是admin admin,登陸時需要修改密碼。
6. 配置grafana
-
添加prometheus監控數據及模板,將grafana和prometheus關聯起來,也就是在grafana中添加添加數據源
- 點擊:設置->Data Source->Add data source->選擇prometheus->url內填寫http://IP:9090->save&test
-
點擊:左邊欄Dashboards“+”號內import->輸入“8919”->load->更改name為“Prometheus Node”->victoriaMetrics選擇剛創建的數據源“prometheus”
- 如要使用其他的模板,請到grafana的官網去查找 https://grafana.com/dashboards\
-
設置完成後,點擊"Dashboards",->"victoriaMetrics"->"Prometheus Node"
五、部署alertmanager
1. 下載
https://prometheus.io/download/
2. 解壓
tar -zxvf alertmanager-0.25.0.linux-amd64.tar.gz -C /data/
cd /data
mv alertmanager-0.25.0.linux-amd64/ alertmanager
chown -R prometheus:prometheus /data/alertmanager
mkdir -p /data/alertmanager/data
3. 配置報警系統altermanger服務
vim /data/alertmanager/alertmanager.yml
(最初配置)
global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
4. 配置systemctl啟動項
vim /etc/systemd/system/alertmanager.service
[Unit] Description=Alertmanager After=network.target [Service] Type=simple User=prometheus ExecStart=/data/alertmanager/alertmanager --config.file=/data/alertmanager/alertmanager.yml --storage.path=/data/alertmanager/data Restart=on-failure [Install] WantedBy=multi-user.target
5. 載入配置&啟動服務
systemctl daemon-reload
systemctl start alertmanager.service
systemctl status alertmanager.service
systemctl enable alertmanager.service
6. 配置promethues.yaml
- 備份
cp /data/prometheus/prometheus.yml /data/prometheus/prometheus.yml.bak
- 編輯
vim /data/prometheus/prometheus.yml
(job_name中有幾台監控的機器就寫幾行)
alerting: alertmanagers: - static_configs: - targets: - 192.168.1.1:9093 rule_files: - "/data/database/prometheus/rules/*.rules" scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['192.168.1.1:9090'] - job_name: 'node' static_configs: - targets: ['192.168.1.2:9100'] - targets: ['192.168.1.3:9100'] - targets: ['192.168.1.4:9100']
- 測試prometheus.yaml文件有無報錯(可以檢測出rules文件有無報錯)
cd /data/prometheus
./promtool check config prometheus.yml
[root@VM-16-2-centos prometheus]# ./promtool check config prometheus.yml Checking prometheus.yml SUCCESS: 1 rule files found SUCCESS: prometheus.yml is valid prometheus config file syntax Checking /data/database/prometheus/rules/node.rules SUCCESS: 21 rules found
7. 創建prometheus的規則文件
mkdir /data/database/prometheus/rules
vim /data/database/prometheus/rules/node.rules
groups: - name: Node-rules rules: - alert: Node-Down expr: up{job="node1"} == 0 for: 1m labels: severity: 嚴重警告 instance: "{{ $labels.instance }}" annotations: summary: "{{$labels.instance }} 節點已經宕機 1分鐘" description: "節點宕機" - alert: Node-CpuHigh expr: (1 - avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m]))) * 100 > 80 for: 1m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} cpu使用率超 80%" description: "CPU 使用率為 {{ $value }}%" - alert: Node-CpuIowaitHigh expr: avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="iowait"}[5m])) * 100 > 80 for: 1m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} CPU iowait 使用率超過 80%" description: "CPU iowait 使用率為 {{ $value }}%" - alert: Node-MemoryHigh expr: (1 - node_memory_MemAvailable_bytes{job="node"} / node_memory_MemTotal_bytes{job="node"}) * 100 > 80 for: 1m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Memory使用率超過 80%" description: "Memory 使用率為 {{ $value }}%" - alert: Node-Load5High expr: node_load5 > (count by (instance) (node_cpu_seconds_total{job="node",mode='system'})) * 1.2 for: 1m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Load(5m)過高,超出cpu核數1.2倍" description: "Load(5m)過高,超出cpu核數 1.2倍" - alert: Node-DiskRootHigh expr: (1 - node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/"} / node_filesystem_size_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/"}) * 100 > 80 for: 10m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Disk(/ 分區) 使用率超過 80%" description: "Disk(/ 分區) 使用率為 {{ $value }}%" - alert: Node-DiskDataHigh expr: (1 - node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/data"} / node_filesystem_size_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/data"}) * 100 > 80 for: 10m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Disk(/data 分區) 使用率超過 80%" description: "Disk(/data 分區) 使用率為 {{ $value }}%" - alert: Node-DiskReadHigh expr: irate(node_disk_read_bytes_total{job="node"}[5m]) > 20 * (1024 ^ 2) for: 1m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Disk 讀取位元組數速率超過 20 MB/s" description: "Disk 讀取位元組數速率為 {{ $value }}MB/s" - alert: Node-DiskWriteHigh expr: irate(node_disk_written_bytes_total{job="node"}[5m]) > 20 * (1024 ^ 2) for: 1m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Disk 寫入位元組數速率超過 20 MB/s" description: "Disk 寫入位元組數速率為 {{ $value }}MB/s" - alert: Node-DiskReadRateCountHigh expr: irate(node_disk_reads_completed_total{job="node"}[5m]) > 3000 for: 1m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Disk iops 每秒讀取速率超過 3000 iops" description: "Disk iops 每秒讀取速率為 {{ $value }}" - alert: Node-DiskWriteRateCountHigh expr: irate(node_disk_writes_completed_total{job="node"}[5m]) > 3000 for: 1m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Disk iops 每秒寫入速率超過 3000 iops" description: "Disk iops 每秒寫入速率為 {{ $value }}" - alert: Node-InodeRootUsedPercentHigh expr: (1 - node_filesystem_files_free{job="node",fstype=~"ext4|xfs",mountpoint="/"} / node_filesystem_files{job="node",fstype=~"ext4|xfs",mountpoint="/"}) * 100 > 80 for: 10m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Disk (/ 分區) inode 使用率超過 80%" description: "Disk (/ 分區) inode 使用率為 {{ $value }}%" - alert: Node-InodeBootUsedPercentHigh expr: (1 - node_filesystem_files_free{job="node",fstype=~"ext4|xfs",mountpoint="/data"} / node_filesystem_files{job="node",fstype=~"ext4|xfs",mountpoint="/data"}) * 100 > 80 for: 10m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Disk (/data 分區) inode 使用率超過 80%" description: "Disk (/data 分區) inode 使用率為 {{ $value }}%" - alert: Node-FilefdAllocatedPercentHigh expr: node_filefd_allocated{job="node"} / node_filefd_maximum{job="node"} * 100 > 80 for: 10m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Filefd 打開百分比超過 80%" description: "Filefd 打開百分比為 {{ $value }}%" - alert: Node-NetworkNetinBitRateHigh expr: avg by (instance) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]) * 8) > 20 * (1024 ^ 2) * 8 for: 3m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Network 接收比特數速率超過 20MB/s" description: "Network 接收比特數速率為 {{ $value }}MB/s" - alert: Node-NetworkNetoutBitRateHigh expr: avg by (instance) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]) * 8) > 20 * (1024 ^ 2) * 8 for: 3m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Network 接收比特數速率超過 20MB/s" description: "Network 發送比特數速率為 {{ $value }}MB/s" - alert: Node-NetworkNetinPacketErrorRateHigh expr: avg by (instance) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) > 15 for: 3m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Network 接收錯誤包速率超過 15個/秒" description: "Network 接收錯誤包速率為 {{ $value }}個/秒" - alert: Node-NetworkNetoutPacketErrorRateHigh expr: avg by (instance) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) > 15 for: 3m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Network 發送錯誤包速率超過 15個/秒" description: "Network 發送錯誤包速率為 {{ $value }}個/秒" - alert: Node-ProcessBlockedHigh expr: node_procs_blocked{job="node"} > 10 for: 10m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} Process 當前被阻塞的任務的數量超過 10個" description: "Process 當前被阻塞的任務的數量為 {{ $value }}個" - alert: Node-TimeOffsetHigh expr: abs(node_timex_offset_seconds{job="node"}) > 3 * 60 for: 2m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} 節點的時間偏差超過 3m" description: "節點的時間偏差為 {{ $value }}m" - alert: Node-TCPconnection expr: node_sockstat_TCP_tw{job="node"} > 15000 for: 2m labels: severity: 警告 instance: "{{ $labels.instance }}" annotations: summary: "{{ $labels.instance }} TCP 等待關閉的TCP連接數TIME_WAIT過高大於15000" description: "TCP 等待關閉的TCP連接數為 {{ $value }}"
8. 配置alertmanager郵件報警
vim /data/alertmanager/alertmanager.yml
# 全局配置項 global: resolve_timeout: 5m #處理超時時間,預設為5min smtp_smarthost: 'smtp.qq.com:465' #郵箱smtp伺服器代理 smtp_from: '[email protected]' #發送郵箱名稱 smtp_auth_username: '[email protected]' #郵箱名稱 smtp_auth_password: 'asdklfjwiehrqc' #郵箱授權碼 smtp_require_tls: false smtp_hello: 'qq.com'
# 定義報警模板
templates:
- '/data/alertmanager/email.tmpl'
# 定義路由樹信息
route:
group_by: ['alertname'] #報警分組依據
group_wait: 10s #最初即第一次等待多久時間發送一組警報的通知
group_interval: 10s #在發送新警報前的等待時間
repeat_interval: 10m #發送重覆警報的周期 對於email配置中,此項不可以設置過低,否則將會由於郵件發送太多頻繁,被smtp伺服器拒絕
receiver: 'email' #發送警報的接收者的名稱,以下receivers name的名稱