Comunidad MariaDB

监控选型

Percona PMM / Prometheus + mysqld_exporter / Datadog / 云原生——按场景选

速查决策

场景推荐
自建生产PMMProm + mysqld_exporter
已用 DatadogDatadog MySQL Integration
云上托管云厂商内置 + Datadog 补充
小团队、要快PMM(一键、Dashboard 完整)
大公司、要灵活Prom + 自定义 dashboard

Percona PMM(推荐自建首选)

安装

# Server
docker run -d --name pmm-server -p 80:80 -p 443:443 \
  -v pmm-data:/srv \
  percona/pmm-server:2

# Client(在每个 DB 节点)
docker run -d --rm --name pmm-client \
  --network host \
  -e PMM_AGENT_SERVER_ADDRESS=pmm-host:443 \
  -e PMM_AGENT_SERVER_USERNAME=admin \
  -e PMM_AGENT_SERVER_PASSWORD=admin \
  -e PMM_AGENT_SETUP=1 \
  percona/pmm-client:2

# 注册 MariaDB
docker exec pmm-client pmm-admin add mysql \
  --query-source=perfschema \
  --username=monitor --password=xxx \
  --host=127.0.0.1 --port=3306 \
  mariadb-prod

包含什么

  • 实时性能仪表盘(QPS, latency, errors)
  • Query Analytics(最慢 query、最频繁 query)
  • InnoDB metrics
  • Replication lag
  • 主机指标(CPU、内存、磁盘)
  • 告警规则模板

优劣

✅ 一键安装、完整 dashboard ✅ Query Analytics 是杀手锏 ✅ 开源免费

❌ 自托管要资源(PMM Server ~ 2 vCPU + 4 GB) ❌ UI 不如 Datadog

Prometheus + mysqld_exporter

# exporter
docker run -d --name mysqld_exporter --network host \
  -e DATA_SOURCE_NAME='exporter:xxx@(127.0.0.1:3306)/' \
  prom/mysqld-exporter:latest \
  --collect.global_status \
  --collect.info_schema.innodb_metrics \
  --collect.info_schema.processlist \
  --collect.info_schema.replica_host \
  --collect.perf_schema.eventsstatements \
  --collect.perf_schema.eventswaits

Prometheus 配置:

scrape_configs:
  - job_name: mariadb
    static_configs:
      - targets: ['db1:9104', 'db2:9104']

Grafana dashboard:用 "MySQL Overview" (id=7362) 或 "MySQL Exporter Dashboard" (id=14057)。

优劣

✅ Prometheus 生态丰富 ✅ 自定义灵活 ✅ 与 K8s 集成好

❌ 初始配置工作量大 ❌ Query Analytics 需要额外组件(如 mysql-slow-exporter

Datadog MySQL Integration

# datadog.yaml 或 K8s annotation
init_config:
instances:
  - host: db1
    port: 3306
    user: datadog
    pass: xxx
    options:
      replication: true
      galera_cluster: false
      extra_status_metrics: true
      extra_innodb_metrics: true
      schema_size_metrics: true
      query_metrics:
        enabled: true

✅ UI 漂亮、告警强大、APM 集成 ✅ 多产品一站式 ❌ 贵

云厂商内置

  • AWS RDS → CloudWatch + Performance Insights
  • 阿里云 RDS → DAS (Database Autonomy Service)
  • 腾讯 TencentDB → DBbrain
  • SkySQL → 内置 dashboard

通常足够日常,但深入排查要外接专业工具。

必看的核心指标

流量与延迟

指标来源含义
Questionsglobal status总查询数
Queries (excluding non-SQL)global status实际 SQL 查询
Avg query latencyperf_schema平均延迟
P95 / P99 latencyperf_schema长尾
ConnectionsThreads_connected活跃连接数
ErrorsAborted_connects, Aborted_clients连接错误率

InnoDB

指标警戒
Buffer pool hit ratio< 99% 报警
Buffer pool wait free> 0 报警(IO 不够)
Row lock wait time上升趋势
Deadlocks增长率
History list length> 数万

复制

指标警戒
Seconds_Behind_Master> 10s
Slave_IO_Running!= Yes
Slave_SQL_Running!= Yes
Last_IO_Error非空

系统

  • CPU
  • Memory
  • Disk usage(> 80% 报警)
  • Disk IOPS / latency
  • Network throughput

告警规则

# Prometheus AlertManager
groups:
  - name: mariadb
    rules:
      - alert: HighConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
        for: 5m
        labels: { severity: warning }

      - alert: ReplicationLag
        expr: mysql_slave_status_seconds_behind_master > 30
        for: 2m
        labels: { severity: critical }

      - alert: DiskFull
        expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/mysql"} / node_filesystem_size_bytes) < 0.1
        for: 5m
        labels: { severity: critical }

      - alert: SlowQueriesSpike
        expr: rate(mysql_global_status_slow_queries[5m]) > 5
        for: 5m
        labels: { severity: warning }

      - alert: BufferPoolHitLow
        expr: 1 - (rate(mysql_global_status_innodb_buffer_pool_reads[5m]) / rate(mysql_global_status_innodb_buffer_pool_read_requests[5m])) < 0.99
        for: 10m
        labels: { severity: warning }

Query 性能监控

慢日志

slow_query_log = 1
long_query_time = 0.5
log_queries_not_using_indexes = 1

每天用 pt-query-digest 跑一次:

pt-query-digest /var/log/mariadb/slow.log | head -100 > daily-slow.txt

Performance Schema

-- Top 10 慢 query
SELECT digest_text, count_star, avg_timer_wait / 1e9 AS avg_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY avg_timer_wait DESC
LIMIT 10;

PMM 和 Datadog 都自动消费 perf_schema,不用自己写。

日志聚合

把 MariaDB 的 error log / slow log / audit log 送到 Loki / ELK / Sentry:

# Promtail
scrape_configs:
  - job_name: mariadb
    static_configs:
      - targets: [localhost]
        labels:
          job: mariadb
          __path__: /var/log/mariadb/*.log

SLO 例子

- 数据库可用性: 99.95%/月
- P99 query latency < 50ms
- 复制延迟 < 5s
- 备份每日成功率: 100%
- 没有未审计的写操作

每月 review SLO 违反次数。

延伸

On this page