监控选型
Percona PMM / Prometheus + mysqld_exporter / Datadog / 云原生——按场景选
速查决策
| 场景 | 推荐 |
|---|---|
| 自建生产 | PMM 或 Prom + mysqld_exporter |
| 已用 Datadog | Datadog MySQL Integration |
| 云上托管 | 云厂商内置 + Datadog 补充 |
| 小团队、要快 | PMM(一键、Dashboard 完整) |
| 大公司、要灵活 | Prom + 自定义 dashboard |
Percona PMM(推荐自建首选)
安装
# Server
docker run -d --name pmm-server -p 80:80 -p 443:443 \
-v pmm-data:/srv \
percona/pmm-server:2
# Client(在每个 DB 节点)
docker run -d --rm --name pmm-client \
--network host \
-e PMM_AGENT_SERVER_ADDRESS=pmm-host:443 \
-e PMM_AGENT_SERVER_USERNAME=admin \
-e PMM_AGENT_SERVER_PASSWORD=admin \
-e PMM_AGENT_SETUP=1 \
percona/pmm-client:2
# 注册 MariaDB
docker exec pmm-client pmm-admin add mysql \
--query-source=perfschema \
--username=monitor --password=xxx \
--host=127.0.0.1 --port=3306 \
mariadb-prod包含什么
- 实时性能仪表盘(QPS, latency, errors)
- Query Analytics(最慢 query、最频繁 query)
- InnoDB metrics
- Replication lag
- 主机指标(CPU、内存、磁盘)
- 告警规则模板
优劣
✅ 一键安装、完整 dashboard ✅ Query Analytics 是杀手锏 ✅ 开源免费
❌ 自托管要资源(PMM Server ~ 2 vCPU + 4 GB) ❌ UI 不如 Datadog
Prometheus + mysqld_exporter
# exporter
docker run -d --name mysqld_exporter --network host \
-e DATA_SOURCE_NAME='exporter:xxx@(127.0.0.1:3306)/' \
prom/mysqld-exporter:latest \
--collect.global_status \
--collect.info_schema.innodb_metrics \
--collect.info_schema.processlist \
--collect.info_schema.replica_host \
--collect.perf_schema.eventsstatements \
--collect.perf_schema.eventswaitsPrometheus 配置:
scrape_configs:
- job_name: mariadb
static_configs:
- targets: ['db1:9104', 'db2:9104']Grafana dashboard:用 "MySQL Overview" (id=7362) 或 "MySQL Exporter Dashboard" (id=14057)。
优劣
✅ Prometheus 生态丰富 ✅ 自定义灵活 ✅ 与 K8s 集成好
❌ 初始配置工作量大
❌ Query Analytics 需要额外组件(如 mysql-slow-exporter)
Datadog MySQL Integration
# datadog.yaml 或 K8s annotation
init_config:
instances:
- host: db1
port: 3306
user: datadog
pass: xxx
options:
replication: true
galera_cluster: false
extra_status_metrics: true
extra_innodb_metrics: true
schema_size_metrics: true
query_metrics:
enabled: true✅ UI 漂亮、告警强大、APM 集成 ✅ 多产品一站式 ❌ 贵
云厂商内置
- AWS RDS → CloudWatch + Performance Insights
- 阿里云 RDS → DAS (Database Autonomy Service)
- 腾讯 TencentDB → DBbrain
- SkySQL → 内置 dashboard
通常足够日常,但深入排查要外接专业工具。
必看的核心指标
流量与延迟
| 指标 | 来源 | 含义 |
|---|---|---|
Questions | global status | 总查询数 |
Queries (excluding non-SQL) | global status | 实际 SQL 查询 |
| Avg query latency | perf_schema | 平均延迟 |
| P95 / P99 latency | perf_schema | 长尾 |
| Connections | Threads_connected | 活跃连接数 |
| Errors | Aborted_connects, Aborted_clients | 连接错误率 |
InnoDB
| 指标 | 警戒 |
|---|---|
| Buffer pool hit ratio | < 99% 报警 |
| Buffer pool wait free | > 0 报警(IO 不够) |
| Row lock wait time | 上升趋势 |
| Deadlocks | 增长率 |
| History list length | > 数万 |
复制
| 指标 | 警戒 |
|---|---|
Seconds_Behind_Master | > 10s |
Slave_IO_Running | != Yes |
Slave_SQL_Running | != Yes |
Last_IO_Error | 非空 |
系统
- CPU
- Memory
- Disk usage(> 80% 报警)
- Disk IOPS / latency
- Network throughput
告警规则
# Prometheus AlertManager
groups:
- name: mariadb
rules:
- alert: HighConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
for: 5m
labels: { severity: warning }
- alert: ReplicationLag
expr: mysql_slave_status_seconds_behind_master > 30
for: 2m
labels: { severity: critical }
- alert: DiskFull
expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/mysql"} / node_filesystem_size_bytes) < 0.1
for: 5m
labels: { severity: critical }
- alert: SlowQueriesSpike
expr: rate(mysql_global_status_slow_queries[5m]) > 5
for: 5m
labels: { severity: warning }
- alert: BufferPoolHitLow
expr: 1 - (rate(mysql_global_status_innodb_buffer_pool_reads[5m]) / rate(mysql_global_status_innodb_buffer_pool_read_requests[5m])) < 0.99
for: 10m
labels: { severity: warning }Query 性能监控
慢日志
slow_query_log = 1
long_query_time = 0.5
log_queries_not_using_indexes = 1每天用 pt-query-digest 跑一次:
pt-query-digest /var/log/mariadb/slow.log | head -100 > daily-slow.txtPerformance Schema
-- Top 10 慢 query
SELECT digest_text, count_star, avg_timer_wait / 1e9 AS avg_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY avg_timer_wait DESC
LIMIT 10;PMM 和 Datadog 都自动消费 perf_schema,不用自己写。
日志聚合
把 MariaDB 的 error log / slow log / audit log 送到 Loki / ELK / Sentry:
# Promtail
scrape_configs:
- job_name: mariadb
static_configs:
- targets: [localhost]
labels:
job: mariadb
__path__: /var/log/mariadb/*.logSLO 例子
- 数据库可用性: 99.95%/月
- P99 query latency < 50ms
- 复制延迟 < 5s
- 备份每日成功率: 100%
- 没有未审计的写操作每月 review SLO 违反次数。