暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

揭秘Zabbix告警无数据问题

数据库驾驶舱 2024-03-12
194

背景:

数据库运维如同一场无休止的博弈,而Zabbix作为监控领域的佼佼者,时刻保持警惕是我们保障系统稳健的关键。然而,最近我们在一台闲置备库上收到这样一个告警:"Zabbix agent nodata more than 10min on oracle数据库db01"。

问题排查:

由于这台Oracle是备库,没有应用连接,所以平时负载也不高,排除了因为高负载导致zabbix无数据的告警。

再查看zabbix日志发现,原来是执行oracle监控脚本的时候超时导致了这个告警。

    26791:20240312:134558.153 Failed to execute command "/usr/local/zabbix/share/externalscripts/oracle_monitor.sh gap": Timeout while executing a shell script.

    手动执行脚本发现主要慢在MERGE JOIN CARTESIAN这一步 执行计划如下:

      SQL Monitoring Report


      SQL Text
      ------------------------------
      select count(*) from gv$archive_gap where exists (select 1 from v$database where database_role ='PHYSICAL STANDBY')


      Global Information
      ------------------------------
      Status : DONE (ALL ROWS)
      Instance ID : 1
      Session : SYS (576:56157)
      SQL ID : 9kph71w0hjrys
      SQL Execution ID : 17460592
      Execution Started : 03/12/2024 13:47:52
      First Refresh Time : 03/12/2024 13:47:58
      Last Refresh Time : 03/12/2024 13:48:15
      Duration : 23s
      Module/Action : sqlplus@db01 (TNS V1-V3)/-
      Service : SYS$USERS
      Program : sqlplus@db01 (TNS V1-V3)
      Fetch Calls : 1


      Global Stats
      =======================================================
      | Elapsed | Cpu | Other | Fetch | Read | Read |
      | Time(s) | Time(s) | Waits(s) | Calls | Reqs | Bytes |
      =======================================================
      | 23 | 23 | 0.00 | 1 | 131 | 18MB |
      =======================================================


      SQL Plan Monitoring Details (Plan Hash Value=180725793)
      =====================================================================================================================================================
      | Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Mem | Activity | Activity Detail |
      | | | | (Estim) | | Active(s) | Active | | (Actual) | (Max) | (%) | (# samples) |
      =====================================================================================================================================================
      | 0 | SELECT STATEMENT | | | | 1 | +23 | 1 | 1 | | | |
      | 1 | SORT AGGREGATE | | 1 | | 1 | +23 | 1 | 1 | | | |
      | 2 | FILTER | | | | | | 1 | | | | |
      | 3 | FILTER | | | | | | 1 | | | | |
      | 4 | HASH JOIN OUTER | | 5 | 4 | | | 1 | | 212K | | |
      | 5 | HASH JOIN | | 5 | 3 | | | 1 | | 209K | | |
      | 6 | VIEW | | 1 | 2 | | | 1 | | | | |
      | 7 | FILTER | | | | | | 1 | | | | |
      | 8 | HASH GROUP BY | | 1 | 2 | | | 1 | | | | |
      | 9 | HASH JOIN | | 1 | 1 | 18 | +6 | 1 | 0 | 2M | | |
      | 10 | FIXED TABLE FULL | X$KCCAL | 1 | | 1 | +6 | 1 | 14336 | | | |
      | 11 | VIEW | | 1 | 1 | 1 | +23 | 1 | 1 | | | |
      | 12 | HASH GROUP BY | | 1 | 1 | 23 | +1 | 1 | 1 | 4M | 60.87 | Cpu (14) |
      | 13 | FILTER | | | | 18 | +6 | 1 | 87M | | 4.35 | Cpu (1) |
      | 14 | HASH JOIN | | 1 | | 22 | +2 | 1 | 87M | 2M | 34.78 | Cpu (8) |
      | 15 | FIXED TABLE FULL | X$KCCLH | 100 | | 1 | +6 | 1 | 9344 | | | |
      | 16 | MERGE JOIN CARTESIAN | | 100 | | 18 | +6 | 1 | 9344 | | | |
      | 17 | FIXED TABLE FULL | X$KCCIC | 1 | | 17 | +6 | 1 | 1 | | | |
      | 18 | BUFFER SORT | | 100 | | 18 | +6 | 1 | 9344 | 628K | | |
      | 19 | FIXED TABLE FULL | X$KCCLH | 100 | | 1 | +6 | 1 | 9344 | | | |
      | 20 | VIEW | | 5 | 1 | | | | | | | |
      | 21 | HASH GROUP BY | | 5 | 1 | | | | | | | |
      | 22 | FIXED TABLE FULL | X$KCCLH | 5 | | | | | | | | |
      | 23 | SORT AGGREGATE | | 1 | | | | | | | | |
      | 24 | FIXED TABLE FULL | X$KCCFE | 1 | | | | | | | | |
      | 25 | VIEW | | 1 | 1 | | | | | | | |
      | 26 | HASH GROUP BY | | 1 | 1 | | | | | | | |
      | 27 | FILTER | | | | | | | | | | |
      | 28 | HASH JOIN | | 1 | | | | | | | | |
      | 29 | NESTED LOOPS | | 1 | | | | | | | | |
      | 30 | FIXED TABLE FULL | X$KCCFE | 1 | | | | | | | | |
      | 31 | FIXED TABLE FULL | X$KCCLH | 1 | | | | | | | | |
      | 32 | FIXED TABLE FULL | X$KCCIC | 1 | | | | | | | | |
      | 33 | MERGE JOIN CARTESIAN | | 1 | | 1 | +6 | 1 | 1 | | | |
      | 34 | FIXED TABLE FULL | X$KCCDI | 1 | | 1 | +6 | 1 | 1 | | | |
      | 35 | BUFFER SORT | | 100 | | 1 | +6 | 1 | 1 | | | |
      | 36 | FIXED TABLE FULL | X$KCCDI2 | 100 | | 1 | +6 | 1 | 1 | | | |
      =====================================================================================================================================================

      为了解决性能问题,我们查阅了MOS文章,发现了下面这篇文章

      Bug 18411339 - Low performance or ORA-1220 for query on V$ARCHIVE_GAP on 11.2.0.4 or later release (Doc ID 18411339.8)

      Description

      Low performance or ORA-1220 thrown for query on V$ARCHIVE_GAP on 11.2.0.4 or later release

      due to execution plan on V$ARCHIVE_GAP using MERGE JOIN CARTESIAN internally involving  

      X$KCCLH, X$KCCLH, X$KCCAL.

      解决办法1: 重构现有的监控脚本,用下面脚本监控dg gap

        select USERENV('Instance'), high.thread#, low.lsq, high.hsq
        from
        (select a.thread#, rcvsq, min(a.sequence#)-1 hsq
        from v$archived_log a,
        (select lh.thread#, lh.resetlogs_change#, max(lh.sequence#) rcvsq
        from v$log_history lh, v$database_incarnation di
        where lh.resetlogs_time = di.resetlogs_time
        and lh.resetlogs_change# = di.resetlogs_change#
        and di.status = 'CURRENT'
        and lh.thread# is not null
        and lh.resetlogs_change# is not null
        and lh.resetlogs_time is not null
        group by lh.thread#, lh.resetlogs_change#
        ) b
        where a.thread# = b.thread#
        and a.resetlogs_change# = b.resetlogs_change#
        and a.sequence# > rcvsq
        group by a.thread#, rcvsq) high,
        (select srl_lsq.thread#, nvl(lh_lsq.lsq, srl_lsq.lsq) lsq
        from
        (select thread#, min(sequence#)+1 lsq
        from
        v$log_history lh, x$kccfe fe, v$database_incarnation di
        where to_number(fe.fecps) <= lh.next_change#
        and to_number(fe.fecps) >= lh.first_change#
        and fe.fedup!=0 and bitand(fe.festa, 12) = 12
        and di.resetlogs_time = lh.resetlogs_time
        and lh.resetlogs_change# = di.resetlogs_change#
        and di.status = 'CURRENT'
        group by thread#) lh_lsq,
        (select thread#, max(sequence#)+1 lsq
        from
        v$log_history
        where (select min( to_number(fe.fecps))
        from x$kccfe fe
        where fe.fedup!=0 and bitand(fe.festa, 12) = 12)
        >= next_change#
        group by thread#) srl_lsq
        where srl_lsq.thread# = lh_lsq.thread#(+)
        ) low
        where low.thread# = high.thread#
        and lsq < = hsq
        and hsq > rcvsq;

        解决办法2:从zabbix角度处理问题。

          zabbix-agent.confTimeout的值改为30

          关注公众号,深入技术细节,共享数据库运维心得。

          文章转载自数据库驾驶舱,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

          评论