背景:
数据库运维如同一场无休止的博弈,而Zabbix作为监控领域的佼佼者,时刻保持警惕是我们保障系统稳健的关键。然而,最近我们在一台闲置备库上收到这样一个告警:"Zabbix agent nodata more than 10min on oracle数据库db01"。
问题排查:
由于这台Oracle是备库,没有应用连接,所以平时负载也不高,排除了因为高负载导致zabbix无数据的告警。
再查看zabbix日志发现,原来是执行oracle监控脚本的时候超时导致了这个告警。
26791:20240312:134558.153 Failed to execute command "/usr/local/zabbix/share/externalscripts/oracle_monitor.sh gap": Timeout while executing a shell script.
手动执行脚本发现主要慢在MERGE JOIN CARTESIAN这一步 执行计划如下:
SQL Monitoring ReportSQL Text------------------------------select count(*) from gv$archive_gap where exists (select 1 from v$database where database_role ='PHYSICAL STANDBY')Global Information------------------------------Status : DONE (ALL ROWS)Instance ID : 1Session : SYS (576:56157)SQL ID : 9kph71w0hjrysSQL Execution ID : 17460592Execution Started : 03/12/2024 13:47:52First Refresh Time : 03/12/2024 13:47:58Last Refresh Time : 03/12/2024 13:48:15Duration : 23sModule/Action : sqlplus@db01 (TNS V1-V3)/-Service : SYS$USERSProgram : sqlplus@db01 (TNS V1-V3)Fetch Calls : 1Global Stats=======================================================| Elapsed | Cpu | Other | Fetch | Read | Read || Time(s) | Time(s) | Waits(s) | Calls | Reqs | Bytes |=======================================================| 23 | 23 | 0.00 | 1 | 131 | 18MB |=======================================================SQL Plan Monitoring Details (Plan Hash Value=180725793)=====================================================================================================================================================| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Mem | Activity | Activity Detail || | | | (Estim) | | Active(s) | Active | | (Actual) | (Max) | (%) | (# samples) |=====================================================================================================================================================| 0 | SELECT STATEMENT | | | | 1 | +23 | 1 | 1 | | | || 1 | SORT AGGREGATE | | 1 | | 1 | +23 | 1 | 1 | | | || 2 | FILTER | | | | | | 1 | | | | || 3 | FILTER | | | | | | 1 | | | | || 4 | HASH JOIN OUTER | | 5 | 4 | | | 1 | | 212K | | || 5 | HASH JOIN | | 5 | 3 | | | 1 | | 209K | | || 6 | VIEW | | 1 | 2 | | | 1 | | | | || 7 | FILTER | | | | | | 1 | | | | || 8 | HASH GROUP BY | | 1 | 2 | | | 1 | | | | || 9 | HASH JOIN | | 1 | 1 | 18 | +6 | 1 | 0 | 2M | | || 10 | FIXED TABLE FULL | X$KCCAL | 1 | | 1 | +6 | 1 | 14336 | | | || 11 | VIEW | | 1 | 1 | 1 | +23 | 1 | 1 | | | || 12 | HASH GROUP BY | | 1 | 1 | 23 | +1 | 1 | 1 | 4M | 60.87 | Cpu (14) || 13 | FILTER | | | | 18 | +6 | 1 | 87M | | 4.35 | Cpu (1) || 14 | HASH JOIN | | 1 | | 22 | +2 | 1 | 87M | 2M | 34.78 | Cpu (8) || 15 | FIXED TABLE FULL | X$KCCLH | 100 | | 1 | +6 | 1 | 9344 | | | || 16 | MERGE JOIN CARTESIAN | | 100 | | 18 | +6 | 1 | 9344 | | | || 17 | FIXED TABLE FULL | X$KCCIC | 1 | | 17 | +6 | 1 | 1 | | | || 18 | BUFFER SORT | | 100 | | 18 | +6 | 1 | 9344 | 628K | | || 19 | FIXED TABLE FULL | X$KCCLH | 100 | | 1 | +6 | 1 | 9344 | | | || 20 | VIEW | | 5 | 1 | | | | | | | || 21 | HASH GROUP BY | | 5 | 1 | | | | | | | || 22 | FIXED TABLE FULL | X$KCCLH | 5 | | | | | | | | || 23 | SORT AGGREGATE | | 1 | | | | | | | | || 24 | FIXED TABLE FULL | X$KCCFE | 1 | | | | | | | | || 25 | VIEW | | 1 | 1 | | | | | | | || 26 | HASH GROUP BY | | 1 | 1 | | | | | | | || 27 | FILTER | | | | | | | | | | || 28 | HASH JOIN | | 1 | | | | | | | | || 29 | NESTED LOOPS | | 1 | | | | | | | | || 30 | FIXED TABLE FULL | X$KCCFE | 1 | | | | | | | | || 31 | FIXED TABLE FULL | X$KCCLH | 1 | | | | | | | | || 32 | FIXED TABLE FULL | X$KCCIC | 1 | | | | | | | | || 33 | MERGE JOIN CARTESIAN | | 1 | | 1 | +6 | 1 | 1 | | | || 34 | FIXED TABLE FULL | X$KCCDI | 1 | | 1 | +6 | 1 | 1 | | | || 35 | BUFFER SORT | | 100 | | 1 | +6 | 1 | 1 | | | || 36 | FIXED TABLE FULL | X$KCCDI2 | 100 | | 1 | +6 | 1 | 1 | | | |=====================================================================================================================================================
为了解决性能问题,我们查阅了MOS文章,发现了下面这篇文章
Bug 18411339 - Low performance or ORA-1220 for query on V$ARCHIVE_GAP on 11.2.0.4 or later release (Doc ID 18411339.8)
Description
Low performance or ORA-1220 thrown for query on V$ARCHIVE_GAP on 11.2.0.4 or later release
due to execution plan on V$ARCHIVE_GAP using MERGE JOIN CARTESIAN internally involving
X$KCCLH, X$KCCLH, X$KCCAL.
解决办法1: 重构现有的监控脚本,用下面脚本监控dg gap
select USERENV('Instance'), high.thread#, low.lsq, high.hsqfrom(select a.thread#, rcvsq, min(a.sequence#)-1 hsqfrom v$archived_log a,(select lh.thread#, lh.resetlogs_change#, max(lh.sequence#) rcvsqfrom v$log_history lh, v$database_incarnation diwhere lh.resetlogs_time = di.resetlogs_timeand lh.resetlogs_change# = di.resetlogs_change#and di.status = 'CURRENT'and lh.thread# is not nulland lh.resetlogs_change# is not nulland lh.resetlogs_time is not nullgroup by lh.thread#, lh.resetlogs_change#) bwhere a.thread# = b.thread#and a.resetlogs_change# = b.resetlogs_change#and a.sequence# > rcvsqgroup by a.thread#, rcvsq) high,(select srl_lsq.thread#, nvl(lh_lsq.lsq, srl_lsq.lsq) lsqfrom(select thread#, min(sequence#)+1 lsqfromv$log_history lh, x$kccfe fe, v$database_incarnation diwhere to_number(fe.fecps) <= lh.next_change#and to_number(fe.fecps) >= lh.first_change#and fe.fedup!=0 and bitand(fe.festa, 12) = 12and di.resetlogs_time = lh.resetlogs_timeand lh.resetlogs_change# = di.resetlogs_change#and di.status = 'CURRENT'group by thread#) lh_lsq,(select thread#, max(sequence#)+1 lsqfromv$log_historywhere (select min( to_number(fe.fecps))from x$kccfe fewhere fe.fedup!=0 and bitand(fe.festa, 12) = 12)>= next_change#group by thread#) srl_lsqwhere srl_lsq.thread# = lh_lsq.thread#(+)) lowwhere low.thread# = high.thread#and lsq < = hsqand hsq > rcvsq;
解决办法2:从zabbix角度处理问题。
将zabbix-agent.conf的Timeout的值改为30
关注公众号,深入技术细节,共享数据库运维心得。
文章转载自数据库驾驶舱,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




