前几日一套11.2.0.4.7 on suse 11 2nodes2 RAC中节点2的oracle软件目录使用率高,发现mmnl后台进程生成了64GB的trace文件, 包含大量的ash dump信息, 下面简单的记录一下该问题。
Mmnl trace file
=========================
Note:
Trace文件中包含大旱的ASH DUMP信息,MMNL后台进程负责当前活动会话历史记录(ASH)进行采样,存储在循环SGA内存缓冲区中的信息将通过快照或紧急刷新刷新到AWR。
如果循环缓冲区已满三分之二,则ASH紧急冲洗将启动。如果由于某种原因无法将信息写入AWR,则ASH会将其内容(如果没有重新启动,则每12小时一次)转储到MMNL跟踪文件。
将信息写入跟踪而不是AWR的可能原因包括:
1, 首先确认当前的ASH 大小是否紧张?
如果因为ash buffer size小而紧急刷新,可以从db alert log确认,也可以使用下面的sql确认:
Note:
这里的紧急刷新(AWR_FLUSH_EMERGENCY_COUNT)次数不少,可以调大该参数(可在线调整),确认ash buffer从v$ash_info.total_size, 默认“_ash_size”并非当前值,但是修改该内存大小是修改"_ash_size"参数值,可以每次尝试增长50%观察使用情况。
Note:
如果调整的值小于当前值会提示ora-2097错误,但是在12C之前的版本超过了“_ash_size”的最大上限254Mb,同样会提示ora-2097错误。 但是从12c以后版本_ash_size值可以调整到大于254MB.
2. 检查MMON
既然写了这么多的ASH dump说明很多次是flush AWR失败的, 负责AWR Snapshot收集的是MMON进程, 然后检查MMON trace无日志文件。
Note:
该DB实例的MMON进程都不存在,且近段时间均为未生成AWR snapshot.
AWR没有生成SNAPSHOT的原因:
1,数据库配置问题和当前状态
2,数据库MMON后台进程问题
实例1的AWR正常,只有实例2的存在这现象,此时可以尝试手动生成AWR Snapshot,检查是否有报错或是某项收集时间长。
Note:
手动是可以生成,说明在AWR实际收集是没有问题的,只是调度进程出了问题, 当然如果手动收集时间都异常久可以配置AWR debug trace查看具体日志
等待AWR snapshot生成后,关闭TRACE对应_swrf_test_action值使用29和11, 同时还有awr_test dump event level(1,2,3,4)可以trace AWR的过程。该事件虽然是alter session, 但影响是实例级的。
对于本案例主要是因为MMON进程不存在,配置EVENT已没有意义, 下一步查看db alert log 查找MMON进程没有启动或停止的原因。
Note:
根据DB ALERT LOG可以发现该实例启动后,过了一段时间MMON进程因ORA-7445内部错误挂掉后就一直没有再启动,AWR SNAPSHOT也一直没有再生成。 MOS中有个相似的CASE。 Bug 22494766 : ORA-7445[KOCCNGT+74] ON MMON PROCESS AFTER ORA-25455
解决方法
1,重启实例
或者
2. 临时启动并禁用限制模式
ALTER SYSTEM enable restricted session;
ALTER SYSTEM disable restricted session;
因为这是一个新交维的边缘库,平时没有什么新连接和活动,但是重启实例目前是不可以的。 启用限制模式只影响新连接请求,已有连接无影响,我们采用方案2.
-- db alert log
后期确认数据库实例已经可以正常收集AWR,MMON记启动,同时MMNL进程已重启,(KILL两进程对实例可用性无影响,该进程可以自动重启)。MMNL TRACE没有再生成那么大量的日志。
-- enjoy it --
Mmnl trace file
=========================
*** 2018-08-22 20:33:13.512
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(74) (!= 73) at 22-AUG-18 08.53.28.796 PM
*** 2018-08-22 20:53:28.796
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(75) (!= 74) at 22-AUG-18 09.03.37.184 PM
*** 2018-08-22 21:03:37.184
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(76) (!= 75) at 22-AUG-18 09.13.36.996 PM
*** 2018-08-22 21:13:36.996
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
*** 2018-08-22 21:23:37.692
kewa_sampler: Re-signalling MMON(77) (!= 76) at 22-AUG-18 09.23.37.692 PM
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kgskmetricupd: could not get change-schema latch
kgskmetricupd: could not get change-schema latch
kewa_sampler: Re-signalling MMON(78) (!= 77) at 23-AUG-18 12.23.40.741 AM
*** 2018-08-23 00:23:40.741
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kgskmetricupd: could not get change-schema latch
kgskmetricupd: could not get change-schema latch
*** 2018-08-23 01:02:31.559
kgskmetricupd: could not get change-schema latch
*** 2018-08-23 01:04:30.886
kgskmetricupd: could not get change-schema latch
kewa_sampler: Re-signalling MMON(79) (!= 78) at 23-AUG-18 01.23.57.348 AM
*** 2018-08-23 01:23:57.349
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
ASH dump
*** 2018-08-23 01:38:45.614
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP HEADER BEGIN>>>
****************
SCRIPT TO IMPORT
****************
------------------------------------------
Step 1: Create destination table <ashdump>
------------------------------------------
CREATE TABLE ashdump AS
SELECT * FROM SYS.WRH$_ACTIVE_SESSION_HISTORY WHERE rownum < 0
----------------------------------------------------------------
Step 2: Create the SQL*Loader control file <ashldr.ctl> as below
----------------------------------------------------------------
load data
infile * "str '\
####\
'"
append
into table ashdump
fields terminated by ',' optionally enclosed by '"'
(
SNAP_ID CONSTANT 0 ,
DBID ,
INSTANCE_NUMBER ,
...
...
...
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP END>>>
Note:
Trace文件中包含大旱的ASH DUMP信息,MMNL后台进程负责当前活动会话历史记录(ASH)进行采样,存储在循环SGA内存缓冲区中的信息将通过快照或紧急刷新刷新到AWR。
如果循环缓冲区已满三分之二,则ASH紧急冲洗将启动。如果由于某种原因无法将信息写入AWR,则ASH会将其内容(如果没有重新启动,则每12小时一次)转储到MMNL跟踪文件。
将信息写入跟踪而不是AWR的可能原因包括:
ASH缓冲区太小<===首先检查ASH缓冲区是否太小
MMON进程没有响应<====然后,检查MMON进程没有响应的原因
1, 首先确认当前的ASH 大小是否紧张?
如果因为ash buffer size小而紧急刷新,可以从db alert log确认,也可以使用下面的sql确认:
SQL> select total_size,awr_flush_count,awr_flush_emergency_count from v$ash_info;
TOTAL_SIZE AWR_FLUSH_COUNT AWR_FLUSH_EMERGENCY_COUNT
---------- --------------- -------------------------
134217728 20529 6828
Note:
这里的紧急刷新(AWR_FLUSH_EMERGENCY_COUNT)次数不少,可以调大该参数(可在线调整),确认ash buffer从v$ash_info.total_size, 默认“_ash_size”并非当前值,但是修改该内存大小是修改"_ash_size"参数值,可以每次尝试增长50%观察使用情况。
SQL> alter system set "_ash_size"=100m;
alter system set "_ash_size"=100m
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-02097: parameter cannot be modified because specified value is invalid
SQL> @pd ash_size
Show all parameters and session values from x$ksppi/x$ksppcv...
INDX I_HEX NAME VALUE DESCRIPTION
---------- ----- -------------------------------------------------- ------------------------------ ----------------------------------------------------------------------
2647 A57 _olap_dimension_corehash_size 30 OLAP Dimension In-Core Hash Table Maximum Memory Use
2731 AAB _kffmap_hash_size 1024 size of kffmap_hash table
2732 AAC _kffmop_hash_size 2048 size of kffmop_hash table
2840 B18 _ash_size 1048618 To set the size of the in-memory Active Session History buffers
SQL> alter system set "_ash_size"=200m;
System altered.
SQL> select total_size,awr_flush_count,awr_flush_emergency_count from v$ash_info;
TOTAL_SIZE AWR_FLUSH_COUNT AWR_FLUSH_EMERGENCY_COUNT
---------- --------------- -------------------------
209715200 20531 6829
Note:
如果调整的值小于当前值会提示ora-2097错误,但是在12C之前的版本超过了“_ash_size”的最大上限254Mb,同样会提示ora-2097错误。 但是从12c以后版本_ash_size值可以调整到大于254MB.
2. 检查MMON
既然写了这么多的ASH dump说明很多次是flush AWR失败的, 负责AWR Snapshot收集的是MMON进程, 然后检查MMON trace无日志文件。
oracle@ANBOB-2:/oracle/app/oracle/diag/rdbms/ANBOB/ANBOB2/trace>ps -ef|grep mmon
grid 8576 1 0 2016 ? 03:29:29 asm_mmon_+ASM2
oracle@ANBOB-2:/oracle/app/oracle/diag/rdbms/ANBOB/ANBOB2/trace>ps -ef|grep mmnl
grid 8578 1 0 2016 ? 2-17:07:26 asm_mmnl_+ASM2
oracle 9176 1 1 2016 ? 11-20:41:46 ora_mmnl_ANBOB2
SQL> col systimestamp form a35
SQL> col most_recent_snap_time form a25
SQL> col snap_interval form a17
SQL> select systimestamp, most_recent_snap_time, snap_interval from wrm$_wr_control where dbid = (select dbid from v$database);
SYSTIMESTAMP MOST_RECENT_SNAP_TIME SNAP_INTERVAL
----------------------------------- ------------------------- -----------------
24-AUG-18 09.36.27.638363 AM +08:00 24-AUG-18 09.00.33.122 AM +00000 01:00:00.0
SQL> select thread#,open_time from v$thread;
THREAD# OPEN_TIME
---------- -----------------
1 20161225 17:46:00
2 20160226 14:31:03
SQL> select open_mode from v$database;
OPEN_MODE
--------------------
READ WRITE
SQL> col instance_number form 999 head INST
SQL> col begin_interval_time form a25
SQL> col flush_elapsed form a17
SQL> col status form 999
SQL> col error_count form 999 head ERR
SQL> col snap_flag form 999 head SNAP
SQL> select * from
(select snap_id,
instance_number,
begin_interval_time,
flush_elapsed,
status,
error_count,
snap_flag
from wrm$_snapshot
where dbid = (select dbid from v$database)
and instance_number=2 --# instance 2
order by snap_id desc)
where rownum <= 10
order by snap_id ;
-- none --
Note:
该DB实例的MMON进程都不存在,且近段时间均为未生成AWR snapshot.
AWR没有生成SNAPSHOT的原因:
1,数据库配置问题和当前状态
1.1 实例参数STATISTICS_LEVEL
1.2 数据库没打开
1.3 数据库打开没有在READ WRITE模式
1.4 配置的间隔时间问题
2,数据库MMON后台进程问题
2.1 MMON进程未启动
2.2 MMON进程挂起
2.3 MMON在收集某项时时间异常久,没有完成
2.4 SYSAUX表空间可用空间紧张
实例1的AWR正常,只有实例2的存在这现象,此时可以尝试手动生成AWR Snapshot,检查是否有报错或是某项收集时间长。
SQL> exec dbms_workload_repository.create_snapshot;
SQL> exec dbms_workload_repository.create_snapshot;
SQL>
SQL> select * from
2 (select snap_id,
3 instance_number,
4 begin_interval_time,
5 flush_elapsed,
6 status,
7 error_count,
8 snap_flag
9 from wrm$_snapshot
10 where dbid = (select dbid from v$database)
11 and instance_number=2 --# instance 2
12 order by snap_id desc)
13 where rownum <= 10
14 order by snap_id ;
Snap Id INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP
--------- ---- ------------------------- ----------------- ------ ---- ----
22765 2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2 0 0 1
22766 2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3 0 0 1
Note:
手动是可以生成,说明在AWR实际收集是没有问题的,只是调度进程出了问题, 当然如果手动收集时间都异常久可以配置AWR debug trace查看具体日志
Gathering MMON action trace and snapshot flush trace:
SQL> alter session set "_swrf_test_action" = 28;
Session altered.
SQL> alter session set "_swrf_test_action" = 10;
Session altered.
等待AWR snapshot生成后,关闭TRACE对应_swrf_test_action值使用29和11, 同时还有awr_test dump event level(1,2,3,4)可以trace AWR的过程。该事件虽然是alter session, 但影响是实例级的。
对于本案例主要是因为MMON进程不存在,配置EVENT已没有意义, 下一步查看db alert log 查找MMON进程没有启动或停止的原因。
# alert log
Sat Sep 16 05:56:38 2017
VKRM started with pid=87, OS id=44418
Sat Sep 16 07:24:10 2017
Thread 2 advanced to log sequence 3377 (LGWR switch)
Current log# 3 seq# 3377 mem# 0: +DATADG/yygk/redo03.log
Sat Sep 16 09:14:04 2017
ORA-25455 encountered when generating server alert SMG-3500
Sat Sep 16 09:15:04 2017
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x9929640, kocdsgt()+368] [flags: 0x0, count: 1]
Errors in file /oracle/app/oracle/diag/rdbms/yygk/yygk2/trace/yygk2_mmon_9174.trc (incident=288273):
ORA-07445: exception encountered: core dump [kocdsgt()+368] [SIGSEGV] [ADDR:0x0] [PC:0x9929640] [SI_KERNEL(general_protection)] []
Incident details in: /oracle/app/oracle/diag/rdbms/yygk/yygk2/incident/incdir_288273/yygk2_mmon_9174_i288273.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sat Sep 16 09:15:06 2017
Dumping diagnostic data in directory=[cdmp_20170916091506], requested by (instance=2, osid=9174 (MMON)), summary=[incident=288273].
Sat Sep 16 14:58:12 2017
Thread 2 advanced to log sequence 3378 (LGWR switch)
Current log# 4 seq# 3378 mem# 0: +DATADG/yygk/redo04.log
Sun Sep 17 01:56:39 2017
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Sun Sep 17 05:56:39 2017
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Sun Sep 17 05:56:39 2017
Starting background process VKRM
Sun Sep 17 05:56:39 2017
oracle@anbob-2:/home/oracle>vi /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_288273/anbob2_mmon_9174_i288273.trc
Dump file /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_288273/anbob2_mmon_9174_i288273.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/11.2.0/db
System name: Linux
Node name: anbob-2
Release: 3.0.76-0.11-default
Version: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)
Machine: x86_64
Instance name: anbob2
Redo thread mounted by this instance: 2
Oracle process number: 34
Unix process pid: 9174, image: oracle@anbob-2 (MMON)
*** 2017-09-16 09:15:05.098
*** SESSION ID:(4795.1) 2017-09-16 09:15:05.098
*** CLIENT ID:() 2017-09-16 09:15:05.098
*** SERVICE NAME:(SYS$BACKGROUND) 2017-09-16 09:15:05.098
*** MODULE NAME:() 2017-09-16 09:15:05.098
*** ACTION NAME:() 2017-09-16 09:15:05.098
Dump continued from file: /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_mmon_9174.trc
ORA-07445: exception encountered: core dump [kocdsgt()+368] [SIGSEGV] [ADDR:0x0] [PC:0x9929640] [SI_KERNEL(general_protection)] []
========= Dump for incident 288273 (ORA 7445 [kocdsgt()+368]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x9929640, kocdsgt()+368] [flags: 0x0, count: 1]
Registers:
%rax: 0x00007f14f45e3640 %rbx: 0x0000000000000000 %rcx: 0x0000000000000008
%rdx: 0x0140000700000100 %rdi: 0x00007f14f4325568 %rsi: 0x0000000000000000
%rsp: 0x00007fffbec8d8c0 %rbp: 0x00007fffbec8db60 %r8: 0x0000000000000000
%r9: 0x00007f14f4325d70 %r10: 0x00007f14f4326570 %r11: 0x0000000000000000
%r12: 0x00007f14f4344e20 %r13: 0x0000000000002010 %r14: 0x0000000000000000
%r15: 0x00007fffbec8dca8 %rip: 0x0000000009929640 %efl: 0x0000000000010203
kocdsgt()+352 (0x9929630) add $-48,%rdx
kocdsgt()+356 (0x9929634) mov %rdx,-0x40(%rbp)
kocdsgt()+360 (0x9929638) mov -0x40(%rbp),%rax
kocdsgt()+364 (0x992963c) mov 0x8(%rax),%rdx
> kocdsgt()+368 (0x9929640) movzwl (%rdx),%ebx
kocdsgt()+371 (0x9929643) mov -0x38(%rbp),%rcx
kocdsgt()+375 (0x9929647) movzwl 0xc(%rcx),%esi
kocdsgt()+379 (0x992964b) cmp %esi,%ebx
kocdsgt()+381 (0x992964d) jnz 0x992982b
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.
----- Call Stack Trace -----
location type
-------------------- ------
skdstdst()+41 < ksedst1()+103 < ksedst()+39 < dbkedDefDump()+2746 <
ksedmp()+41 < ssexhd()+2462 < __sighandler() < kocdsgt()+368 signal
kocioc()+648 < kokdn2p()+520 < kodin2p()+176 < kocapm()+1051 < koioapm()+525 <
OCITypeByName()+816 < keltgtyp()+75 < kpotcuexec()+254 < rpiswu2()+1776 <
OCIKSwitchUserWithPriv()+200 < keltenq()+583 <
keltpost()+1812 < kelr_issue_alert()309 kelr_monitor_metrics()+3631 <
ksb_run_managed_action()+384 < ksbcti()+2490 < ksbabs()+1735 <
kebm_mmon_main()+209 < ksbrdp()+1045 < opirip()+623 < opidrv()+603 <
sou2o()+103 < opimai_real()+250 < ssthrdmain()+265 <
main()+201
Note:
根据DB ALERT LOG可以发现该实例启动后,过了一段时间MMON进程因ORA-7445内部错误挂掉后就一直没有再启动,AWR SNAPSHOT也一直没有再生成。 MOS中有个相似的CASE。 Bug 22494766 : ORA-7445[KOCCNGT+74] ON MMON PROCESS AFTER ORA-25455
解决方法
1,重启实例
或者
2. 临时启动并禁用限制模式
ALTER SYSTEM enable restricted session;
ALTER SYSTEM disable restricted session;
因为这是一个新交维的边缘库,平时没有什么新连接和活动,但是重启实例目前是不可以的。 启用限制模式只影响新连接请求,已有连接无影响,我们采用方案2.
-- db alert log
2018-08-24 09:01:23.236000 +08:00
Thread 2 advanced to log sequence 6820 (LGWR switch)
Current log# 4 seq# 6820 mem# 0: +DATADG/anbob/redo04.log
2018-08-24 10:25:59.102000 +08:00
Stopping background process MMNL
2018-08-24 10:26:00.101000 +08:00
Starting background process MMON
MMON started with pid=35, OS id=76908
Starting background process MMNL
MMNL started with pid=120, OS id=76910
ALTER SYSTEM enable restricted session;
ALTER SYSTEM disable restricted session;
2018-08-24 10:26:01.811000 +08:00
minact-scn: Inst 2 is a slave inc#:40 mmon proc-id:76908 status:0x2
minact-scn status: grec-scn:0x0ccf.d5bb5cdf gmin-scn:0x0ccf.d5b7d696 gcalc-scn:0x0ccf.d5bb3a01
Some DDE async actions failed or were cancelled
oracle@anbob-2:/home/oracle>ps -ef|grep ora_m|grep -v grep
oracle 9144 1 0 2016 ? 02:07:06 ora_mman_anbob2
oracle 9184 1 0 2016 ? 01:55:18 ora_mark_anbob2
oracle 76908 1 0 10:25 ? 00:01:35 ora_mmon_anbob2
oracle 76910 1 1 10:25 ? 00:05:52 ora_mmnl_anbob2
SQL> select * from
2 (select snap_id,
3 instance_number,
4 begin_interval_time,
5 flush_elapsed,
6 status,
7 error_count,
8 snap_flag
9 from wrm$_snapshot
10 where dbid = (select dbid from v$database)
11 and instance_number=2 --# instance 2
12 order by snap_id desc)
13 where rownum <= 10 14 order by snap_id ;
Snap Id INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP
--------- ---- ------------------------- ----------------- ------ ---- ----
22765 2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2 0 0 1
22766 2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3 0 0 1
SQL> /
Snap Id INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP
--------- ---- ------------------------- ----------------- ------ ---- ----
22765 2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2 0 0 1
22766 2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3 0 0 1
22767 2 24-AUG-18 09.52.24.661 AM +00000 00:00:05.1 0 0 0
22768 2 24-AUG-18 11.00.09.946 AM +00000 00:00:04.0 0 0 0
22769 2 24-AUG-18 12.00.22.008 PM +00000 00:00:03.2 0 0 0
22770 2 24-AUG-18 01.00.33.819 PM +00000 00:00:05.1 0 0 0
22771 2 24-AUG-18 02.00.45.857 PM +00000 00:00:04.0 0 0 0
22772 2 24-AUG-18 03.00.57.960 PM +00000 00:00:04.0 0 0 0
22773 2 24-AUG-18 04.00.09.952 PM +00000 00:00:03.9 0 0 0
22774 2 24-AUG-18 05.00.22.065 PM +00000 00:00:04.1 0 0 0
后期确认数据库实例已经可以正常收集AWR,MMON记启动,同时MMNL进程已重启,(KILL两进程对实例可用性无影响,该进程可以自动重启)。MMNL TRACE没有再生成那么大量的日志。
-- enjoy it --
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




