第五章：运维 OceanBase 数据库 5.4 如何监控 OceanBase 数据库和配置告警

OceanBase社区传送门 2023-09-09

187

5.4 如何监控 OceanBase 数据库和配置告警

如何使用传统监控产品监控 OceanBase 数据库

OceanBase 数据库是单进程软件，通常情况下，其性能瓶颈首先不会是 IO，更可能是 CPU、内存和网络等。当然，IO 也会影响 OceanBase 数据库租户的读写性能。

在第二章部署中，我们介绍了 OceanBase 数据库软件的 IO 有三类：

运行日志
数据文件
事务日志

生产环境建议用三块独立的磁盘存储，或者至少使用三个独立的文件系统。

针对 OceanBase 数据库主机，建议部署如下监控。

监控项	描述	应对策略
CPU	监控 CPU 的 USER、SYS、IOWAIT 和整体利用率。	分析 CPU 利用率高的进程。如果是 observer 进程，则进一步分析 SQL。
LOAD	主机的负载，通常和 CPU 密切相关。
内存	监控剩余内存。	剩余内存通常很稳定，如果小于 1G，则进程 observer 或 obproxy 有 OOM 风险。
监听 2881/2882/2883 端口	2881 是 observer 连接端口，2883 是 obproxy 连接端口。	确认进程是否存活，查看进程运行日志，分析进程故障或监听失败原因。立即重启进程。
网络流量	监控流量是否打满网卡（万兆）	分析集群是否发生负载均衡，调低数据迁移的并发或者分析是否有大量数据抽取。
IO 吞吐量、利用率和延时	监控数据盘和日志盘的 IO 延时、吞吐量和 IO 利用率。	分析集群是否发生负载均衡，调低数据迁移的并发或者分析是否有大量数据抽取。分析是否坏盘。分析业务 SQL 的执行计划是否有问题等。

如何使用 Prometheus 监控 OceanBase 数据库

在第 2 章里介绍了 OceanBase 监控插件 OBAgent 如何部署。OBAgent 启动后会自动生成适合 Prometheus 系统的配置文件目录 prometheus_config。

Prometheus 安装部署

下载 Promethueus 软件
地址：https://prometheus.io/download/

解压缩安装

sudo tar zxvf prometheus-2.30.3.linux-amd64.tar.gz -C /usr/local/

# 复制 OBAgent 生成的 Prometheus 配置文件到 Prometheus 安装目录中。
sudo mv prometheus_config/ /usr/local/prometheus-2.30.3.linux-amd64/

Prometheus 服务文件

sudo mkdir /var/lib/prometheus
sudo vim /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus-2.30.3.linux-amd64/prometheus --config.file=/usr/local/prometheus-2.30.3.linux-amd64/prometheus_config/prometheus.yaml --storage.tsdb.path=/var/lib/prometheus --web.enable-lifecycle --web.external-url=http://172.20.249.54:9090

[Install]
WantedBy=multi-user.target

启动服务

sudo systemctl daemon-reload

sudo systemctl start prometheus

sudo systemctl status prometheus
[admin@obce00 ~]$ sudo systemctl status prometheus
● prometheus.service - Prometheus Server
   Loaded: loaded (/etc/systemd/system/prometheus.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2021-10-21 15:54:42 CST; 49s ago
     Docs: https://prometheus.io/docs/introduction/overview/
 Main PID: 902555 (prometheus)
    Tasks: 13 (limit: 195588)
   Memory: 40.6M
   CGroup: /system.slice/prometheus.service
           └─902555 /usr/local/prometheus-2.30.3.linux-amd64/prometheus --config.file=/usr/local/prometheus-2.30.3.linux-amd64/prometheus_config/prometheus.yaml --storage.tsdb.path=/var/lib/prometheus --web.enable-lifecycle --web.external-url=http://172.20.249.54:9090

Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.275Z caller=head.go:479 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.275Z caller=head.go:513 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=2.127µs
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.275Z caller=head.go:519 component=tsdb msg="Replaying WAL, this may take a while"
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.275Z caller=head.go:590 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.275Z caller=head.go:596 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=39.378µs wal_replay_duration=185.207µs total_replay_duration=242.438µs
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.277Z caller=main.go:849 fs_type=XFS_SUPER_MAGIC
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.277Z caller=main.go:852 msg="TSDB started"
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.277Z caller=main.go:979 msg="Loading configuration file" filename=/usr/local/prometheus-2.30.3.linux-amd64/prometheus_config/prometheus.yaml
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.281Z caller=main.go:1016 msg="Completed loading of configuration file" filename=/usr/local/prometheus-2.30.3.linux-amd64/prometheus_config/prometheus.yaml totalDuration=4.630509ms db_storage=1>
Oct 21 15:54:42 obce00 prometheus[902555]: level=info ts=2021-10-21T07:54:42.281Z caller=main.go:794 msg="Server is ready to receive web requests."
[admin@obce00 ~]$

确认 Prometheus 是否启动成功

sudo netstat -ntlp | grep 9090
[admin@obce00 ~]$ sudo netstat -ntlp | grep 9090
tcp6       0      0 :::9090                 :::*                    LISTEN      902555/prometheus

Prometheus 使用

使用浏览器测试：http://172.20.249.54:9090/alerts。

说明

此处链接中的 IP 为示例中配置 Prometheus 的服务器 IP，仅供参考。您需根据实际情况将其转换为自身配置 Prometheus 的服务器 IP。

查看告警事件

查看节点 LOAD

节点中会涉及到很多自定义的指标名，目前支持的指标名如下：

指标名	Ladel	描述	类型
node_cpu_seconds_total	cpu,mode,svr_ip	CPU 时间	counter
node_disk_read_bytes_total	device,svr_ip	磁盘读取字节数	counter
node_disk_read_time_seconds_total	device,svr_ip	磁盘读取消耗总时间	counter
node_disk_reads_completed_total	device,svr_ip	磁盘读取完成次数	counter
node_disk_written_bytes_total	device,svr_ip	磁盘写入字节数	counter
node_disk_write_time_seconds_total	device,svr_ip	磁盘写入消耗总时间	counter
node_disk_writes_completed_total	device,svr_ip	磁盘写入完成次数	counter
node_filesystem_avail_bytes	device,fstype,mountpoint,svr_ip	文件系统可用大小	gauge
node_filesystem_readonly	device,fstype,mountpoint,svr_ip	文件系统是否只读	gauge
node_filesystem_size_bytes	device,fstype,mountpoint,svr_ip	文件系统大小	gauge
node_load1	svr_ip	1 分钟平均 load	gauge
node_load5	svr_ip	5 分钟平均 load	gauge
node_load15	svr_ip	15 分钟平均 load	gauge
node_memory_Buffers_bytes	svr_ip	内存 buffer 大小	gauge
node_memory_Cached_bytes	svr_ip	内存 cache 大小	gauge
node_memory_MemFree_bytes	svr_ip	内存 free 大小	gauge
node_memory_MemTotal_bytes	svr_ip	内存总大小	gauge
node_network_receive_bytes_total	device,svr_ip	网络接受总字节数	counter
node_network_transmit_bytes_total	device,svr_ip	网络发送总字节数	counter
node_ntp_offset_seconds	svr_ip	NTP 时钟偏移

oceanbase

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者