问题描述
一朋友公司的OA系统挂了(泛微技术支持说是神马神马),友情帮忙分析一下。 如下是alert log信息:
Mon Jun 23 11:28:53 2014 WARNING: inbound connection timed out (ORA-3136) Mon Jun 23 22:00:06 2014 Thread 1 advanced to log sequence 339 (LGWR switch) Current log# 3 seq# 339 mem# 0: /oradata/redo03.log Current log# 3 seq# 339 mem# 1: /oracle/onlinelog/redo03_b.log Wed Jun 25 15:43:34 2014 Thread 1 advanced to log sequence 340 (LGWR switch) Current log# 4 seq# 340 mem# 0: /oradata/redo04.log Current log# 4 seq# 340 mem# 1: /oracle/onlinelog/redo04_b.log Fri Jun 27 15:50:26 2014 Thread 1 advanced to log sequence 341 (LGWR switch) Current log# 1 seq# 341 mem# 0: /oradata/redo01.log Current log# 1 seq# 341 mem# 1: /oracle/onlinelog/redo01_b.log Mon Jun 30 08:44:40 2014 ksvcreate: Process(m000) creation failed Mon Jun 30 08:46:21 2014 PMON failed to acquire latch, see PMON dump Mon Jun 30 08:47:24 2014 PMON failed to acquire latch, see PMON dump Mon Jun 30 08:48:26 2014 PMON failed to acquire latch, see PMON dump Mon Jun 30 08:49:26 2014 PMON failed to acquire latch, see PMON dump Mon Jun 30 08:50:27 2014 PMON failed to acquire latch, see PMON dump Mon Jun 30 08:51:27 2014 PMON failed to acquire latch, see PMON dump Mon Jun 30 08:52:28 2014 PMON failed to acquire latch, see PMON dump Mon Jun 30 08:53:28 2014 PMON failed to acquire latch, see PMON dump Mon Jun 30 08:53:43 2014 Errors in file /oracle/admin/orcloa/bdump/orcloa_pmon_16065.trc: ORA-00474: SMON process terminated with error Mon Jun 30 08:53:52 2014 Errors in file /oracle/admin/orcloa/bdump/orcloa_pmon_16065.trc: ORA-00474: SMON process terminated with error Mon Jun 30 08:53:57 2014 MMAN: terminating instance due to error 472 Instance terminated by MMAN, pid = 16073 Mon Jun 30 08:54:22 2014 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 3 Autotune of undo retention is turned on. IMODE=BR ILAT =194 LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.4.0. System parameters with non-default values:
专家解答
从日志来看,我们可以看到,6月30号08:44:40时出现M000进程无法创建进程的error信息,接着08:46:21开始,
Pmon进程开始报错无法获得latch,最后到08:53:52时间点smon进程出现异常,最后08:53:57时,MMAN进程终止
了数据库实例。
从上面的信息来看,MMAN进程最后强行终止了实例。对于MMAN进程,大家知道它是干什么的么?
首先我们需要了解一下,Oracle数据库中的一些核心进程,如果是单实例环境,那么下面是一些核心进程:
pmon、smon、dbwr、lgwr、reco、ckpt、mman(10g引入)、PSP0. 这些进程中的任何一个进程出现异常都会导致数据库crash。
如果是RAC环境,那么还有lmon,lmd,lck等核心进程。
前面的信息可以看出,smon进程出现了异常,在smon进程报错之前,pmon一直在报错无法获得latch。 下面我们来看下
pmon 进程的trace内容:
PMON unable to acquire latch 600eca50 Child shared pool level=7 child#=1 Location from where latch is held: kghalo: Context saved from call: 0 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 942 (107, 1404089145, 52) 880 (107, 1404089145, 107) 883 (107, 1404089145, 107) 879 (107, 1404089145, 107) 780 (107, 1404089145, 52) 958 (107, 1404089145, 107) 884 (107, 1404089145, 107) 776 (107, 1404089145, 52) 887 (107, 1404089145, 107) 778 (107, 1404089145, 52) 889 (107, 1404089145, 107) 890 (107, 1404089145, 107) 892 (107, 1404089145, 107) 886 (107, 1404089145, 107) 773 (107, 1404089145, 49) 894 (107, 1404089145, 107) 896 (107, 1404089145, 107) 949 (107, 1404089145, 49) 897 (107, 1404089145, 107) 902 (107, 1404089145, 107) 957 (101, 1404089145, 101) 946 (101, 1404089145, 49) 960 (101, 1404089145, 101) 11 (101, 1404089145, 101) 962 (101, 1404089145, 101) 876 (101, 1404089145, 101) 900 (101, 1404089145, 101) 940 (86, 1404089145, 52) 30 (76, 1404089145, 76) 12 (64, 1404089145, 64) 48 (46, 1404089145, 46) waiter count=31 gotten 632636269 times wait, failed first 7151612 sleeps 1177174 gotten 0 times nowait, failed: 0 possible holder pid = 944 ospid=4567 ----------------------------------------
可以看到Pmon进程无法获得shared pool Latch,该latch的地址为:600eca50.且该该地址正被pid=944所持有没有释放,如下:
SO: 0x37ea33630, type: 2, owner: (nil), flag: INIT/-/-/0x00 (process) Oracle pid=944, calls cur/top: (nil)/0x179a01278, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 0 last post received-location: No post last process to post me: none last post sent: 0 0 0 last post sent-location: No post last process posted by me: none (latch info) wait_event=0 bits=280 holding (efd=6) 3721e0170 Child library cache level=5 child#=9 Location from where latch is held: kghfrunp: clatch: nowait: Context saved from call: 0 state=busy, wlstate=free holding (efd=6) 600eca50 Child shared pool level=7 child#=1 Location from where latch is held: kghalo: Context saved from call: 0 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 942 (107, 1404089145, 52) 880 (107, 1404089145, 107) 883 (107, 1404089145, 107) 879 (107, 1404089145, 107) 780 (107, 1404089145, 52) 958 (107, 1404089145, 107) 884 (107, 1404089145, 107) 776 (107, 1404089145, 52) 887 (107, 1404089145, 107) 778 (107, 1404089145, 52) 889 (107, 1404089145, 107) 890 (107, 1404089145, 107) 892 (107, 1404089145, 107) 886 (107, 1404089145, 107) 773 (107, 1404089145, 49) 894 (107, 1404089145, 107) 896 (107, 1404089145, 107) 949 (107, 1404089145, 49) 897 (107, 1404089145, 107) 902 (107, 1404089145, 107) 957 (101, 1404089145, 101) 946 (101, 1404089145, 49) 960 (101, 1404089145, 101) 11 (101, 1404089145, 101) 962 (101, 1404089145, 101) 876 (101, 1404089145, 101) 900 (101, 1404089145, 101) 940 (86, 1404089145, 52) 30 (76, 1404089145, 76) 12 (64, 1404089145, 64) 48 (46, 1404089145, 46) waiter count=31 Process Group: DEFAULT, pseudo proc: 0x37bae40f0 O/S info: user: oracle, term: UNKNOWN, ospid: 4567 OSD pid info: Unix process pid: 4567, image: oracle@localhost.localdomain
可以看到该进程不但持有了shared pool latch(600eca50),同时还持有library cache latch(3721e0170).
一个进程同时持有2个latch,且从这2个latch来看,显然跟shared pool有关系,我们不难判断可能是内存出现异常了。
同时我们分析日志还发现了如下信息:
*** 2014-06-30 08:53:42.242 Background process SMON found dead Oracle pid = 7 OS pid (from detached process) = 16087 OS pid (from process state) = 16087 dtp = 0x6001ed98, proc = 0x37b957198
可以看到smon进程发现了死进程16087,这里信息不足,不知道该进程是什么。结合前面的总总分析不难判断应该是
跟内存有极大关联。同时最开始的alert log中也报错m000无法创建,该进程是mmon的slave进程。下面我们来分析
mmon进程的trace,发现了如下内容:
*** SESSION ID:(1756.3) 2014-06-30 04:01:32.971 Waited for process m000 to initialize for 60 seconds *** 2014-06-30 04:01:33.108 Dumping diagnostic information for m000: OS pid = 29866 loadavg : 0.66 0.19 0.06 memory info: free memory = 0.00M swap info: free = 0.00M alloc = 0.00M total = 0.00M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 D oracle 29866 1 3 78 0 - 3182938 sync_p 04:00 ? 00:00:02 ora_m000_orcloa /bin/sh: /usr/bin/gdb: 没有那个文件或目录 *** 2014-06-30 04:01:33.254 *** 2014-06-30 08:42:47.940 Waited for process m000 to initialize for 60 seconds *** 2014-06-30 08:42:48.605 Dumping diagnostic information for m000: OS pid = 3993 loadavg : 14.13 6.64 2.77 memory info: free memory = 0.00M swap info: free = 0.00M alloc = 0.00M total = 0.00M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 S oracle 3993 1 0 75 0 - 3182936 semtim 08:41 ? 00:00:00 ora_m000_orcloa /bin/sh: /usr/bin/gdb: 没有那个文件或目录 *** 2014-06-30 08:43:06.469 *** 2014-06-30 08:43:20.220 Waited for process m000 to initialize for 70 seconds *** 2014-06-30 08:43:20.220 Dumping diagnostic information for m000: OS pid = 3993 loadavg : 17.64 8.20 3.41 memory info: free memory = 0.00M swap info: free = 0.00M alloc = 0.00M total = 0.00M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 S oracle 3993 1 0 75 0 - 3182936 semtim 08:41 ? 00:00:00 ora_m000_orcloa /bin/sh: /usr/bin/gdb: 没有那个文件或目录 *** 2014-06-30 08:43:21.609 *** 2014-06-30 08:43:32.348
可以看到free memory已经为0了,这里较为遗憾的他这里的环境变量配置可能不对,导致调用gdb失败,否则trace
文件中应该还会存在该进程的call stack等信息,不过这里不足以影响我们整个分析过程。
不知道大家发现没有,比较怪异的是这里居然没有使用swap? 而且total也是0M,那么到没有配置swap吗?
据了解,该主机的物理内存为16GB,而SGA分配了12G,PGA分配了2GB。这明显设置有点偏大了。
最后综合前面的信息,我们可能不难判断他这里数据库之所以crash是因为memory资源不足导致的。
分析到这里就完了吗? 可能很多人跟我一样存在如下的几个疑问:
1) 进程为什么持有shared pool latch不释放?
2) 进程什么情况需要去申请获得shared pool latch?
3) shared pool latch只有一个吗?为什么导致争用?
4) 如果shared pool latch存在多个children latch,那么是否可以避免这个问题?
要回答上面这几个问题,我们需要通过实验来研究一下。首先来看下我的Linux 5.4+oracle 10.2.0.5 环境:
ADDR LATCH# LEVEL# NAME -------- ---------- ---------- -------------------------------------------------- 20010778 216 7 shared pool SQL> select addr,LATCH#,LEVEL#,CHILD#,NAME from v$latch_children where name='shared pool'; ADDR LATCH# LEVEL# CHILD# NAME -------- ---------- ---------- ---------- -------------------------------------------------- 200999EC 216 7 1 shared pool 20099A50 216 7 2 shared pool 20099AB4 216 7 3 shared pool 20099B18 216 7 4 shared pool 20099B7C 216 7 5 shared pool 20099BE0 216 7 6 shared pool 20099C44 216 7 7 shared pool 7 rows selected.
可以看到shared pool latch最多可以有7个children latch可用。下面我们通过实验来探讨一下(这里准备了多个session)。
++++Session 1 SQL> select sid from v$mystat where rownum <2; SID ---------- 544 SQL> select s.sid,s.serial#,p.spid from v$process p,v$session s where p.addr=s.paddr and s.sid=544; SID SERIAL# SPID ---------- ---------- ------------ 544 3 21519 SQL> oradebug setmypid Statement processed. SQL> oradebug poke 0x200999EC 4 1 BEFORE: [200999EC, 200999F0) = 00000000 AFTER: [200999EC, 200999F0) = 00000001 SQL> ++++Session 2 SQL> show parameter kghd NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ _kghdsidx_count integer 1 SQL> select sid from v$mystat where rownum < 2; SID ---------- 525 SQL> select s.sid,s.serial#,p.spid from v$process p,v$session s where p.addr=s.paddr and s.sid=525; SID SERIAL# SPID ---------- ---------- ------------ 525 4 21619
当Session 1 poke 之后,Session 2执行如下查询:
SQL> select count(1) from test1; ......Session hang
正如所料,该查询hang了.
在事先已经连接上的Session 3中进程dump,来观察session 2 hang住之后的情况是什么样的:
++++Session 3 SQL> oradebug setospid 21619 Oracle pid: 26, Unix process pid: 21619, image: oracle@killdb.com (TNS V1-V3) SQL> oradebug dump processstate 10 Statement processed. SQL> oradebug close_trace Statement processed. SQL> oradebug tracefile_name /home/ora10g/admin/roger/udump/roger_ora_21619.trc SQL>
roger_ora_21619.trc 文件的部分信息如下:
Process global information: process: 0x5219498c, call: 0x51d55734, xact: (nil), curses: 0x51d05ef0, usrses: 0x51d05ef0 ---------------------------------------- SO: 0x5219498c, type: 2, owner: (nil), flag: INIT/-/-/0x00 (process) Oracle pid=26, calls cur/top: 0x51d55734/0x51d55734, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 0 last post received-location: No post last process to post me: none last post sent: 0 0 33 last post sent-location: ksrpublish last process posted by me: 5218e1c4 1 22 (latch info) wait_event=0 bits=20 Location from where call was made: kghupr1: Chunk Header Context saved from call: 1318640688 waiting for 200999ec Child shared pool level=7 child#=1 +++++正在等待shared pool Latch Location from where latch is held: kghupr1: Chunk Header Context saved from call: 1310882692 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 10 (21, 1404368123, 21) 26 (15, 1404368123, 15) waiter count=2 gotten 42222 times wait, failed first 5 sleeps 5 gotten 0 times nowait, failed: 0 on wait list for 200999ec holding (efd=3) 4fd3dd80 Child library cache level=5 child#=1 +++++同时该进程还持有一个library cache latch Location from where latch is held: kgllkdl: child: no lock handle: latch Context saved from call: 0 state=busy, wlstate=free Process Group: DEFAULT, pseudo proc: 0x522410a0 O/S info: user: ora10g, term: pts/11, ospid: 21619 OSD pid info: Unix process pid: 21619, image: oracle@killdb.com (TNS V1-V3)
从前面的测试可以看出,当_kghdsidx_count参数为1时,shared pool是没有subpool的,其只有一个shared pool latch.
虽然从v$latch_children查到有7条记录。 也正是因为这样,当我一个进程持有shared pool latch后不释放,那么其他进程
都进行任何查询都将会hang住,因为必须等待shared pool latch.
我们先将latch释放:
SQL> oradebug poke 0x200999EC 4 0 BEFORE: [200999EC, 200999F0) = 000000FF AFTER: [200999EC, 200999F0) = 00000000 SQL>
当latch释放之后,session 2立刻返回结果:
SQL> select count(1) from test1; COUNT(1) ---------- 377
到这里,或许很多人都跟我之前的想法一样?之所以会等待,是由于只要一个shared pool latch可用。那么如果存在多个呢?
即有children latch可用的情况下,其他session 还会hang住吗? 我们继续来看下面的实验:
++++Session 1
SQL> alter system set "_kghdsidx_count"=3 scope=spfile; System altered. SQL> shutdown immediate Database closed. Database dismounted. ORACLE instance shut down. SQL> startup ORACLE instance started. Total System Global Area 838860800 bytes Fixed Size 1276476 bytes Variable Size 629147076 bytes Database Buffers 205520896 bytes Redo Buffers 2916352 bytes Database mounted. Database opened. SQL> conn roger/roger Connected. SQL> show parameter_kghdsidx_count NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ _kghdsidx_count integer 3 SQL> conn /as sysdba Connected. SQL> select addr,LATCH#,LEVEL#,CHILD#,NAME from v$latch_children where name='shared pool'; ADDR LATCH# LEVEL# CHILD# NAME -------- ---------- ---------- ---------- -------------------------------------------------- 200999EC 216 7 1 shared pool 20099A50 216 7 2 shared pool 20099AB4 216 7 3 shared pool 20099B18 216 7 4 shared pool 20099B7C 216 7 5 shared pool 20099BE0 216 7 6 shared pool 20099C44 216 7 7 shared pool 7 rows selected.
这里前面的3个child latch应该是可用的. 我测试过程中发现即持有其中任何一个latch,对于新登陆的SQLPLUS会话都会hang住。
接着还测试了一下硬解析,软解析和软软解析的情况:
---Session 1 SQL> show user USER is "SYS" SQL> oradebug poke 0x20099A50 4 1 BEFORE: [20099A50, 20099A54) = 00000000 AFTER: [20099A50, 20099A54) = 00000001 SQL> ---Session 2 SQL> select count(*) from test1; ......hang ---Session 1 SQL> oradebug poke 0x20099A50 4 1 BEFORE: [20099A50, 20099A54) = 00000000 AFTER: [20099A50, 20099A54) = 00000001 SQL> ---Session 2 SQL> select count(*) from test1; .....hang
可以发现,再次执行该SQL,仍然hang,也就是说软解析,仍然会hang.
当Session 2会话多次执行该SQL之后(我们知道执行超过3次,其对应的cursor就会被cache),如下:
---Session 2 SQL> select count(*) from test1 a; COUNT(*) ---------- 377 SQL> select count(*) from test1 a; COUNT(*) ---------- 377 SQL> select count(*) from test1 a; COUNT(*) ---------- 377
对于软软解析的测试结果是这样的;
---Session 1持有latch SQL> oradebug poke 0x20099A50 4 1 BEFORE: [20099A50, 20099A54) = 00000000 AFTER: [20099A50, 20099A54) = 00000001 SQL> ---Session 2 SQL> select count(*) from test1 a; COUNT(*) ---------- 377 --可以正常执行 如果下面这样,将会hang: SQL> select count(*) from test1 a; .....session hang 如果是这样的方式是没有问题的: SQL> / COUNT(*) ---------- 377
有点让我匪夷所思,不知道大家看了是什么感慨?
所以我认为对于硬解析和阮解析都是需要获得shared pool latch的,另外对于需要操作shared pool那么都需要获得(这里主要是memory消耗或释放).
因此这也不难理解为什么Oracle PMON进程需要获得shared pool latch了。
那么为什么进程会持有latch不释放呢? 我们知道latch的申请或释放正常情况下都是极快的。一般来讲无非有如下情况:
1)进程本身hang住了
2)进程dead
3)Oracle bug
另外,即使shared pool有多个child latch可用,也会导致争用等待。 只是这里还有一个问题,我始终未能想明白,那就是:
假设现在存在3个可用的child latch,假设其中一个被持有,还有2个可用;如果此时有会话A进行操作,那么会话A会怎么样?
从目前的测试来看,会话A一定会hang住。 跟老熊请教了一下,他的猜测是:Oracle这里会根据进程号来进行hash运算得到一个值,
来确认使用哪个child latch,而不是一定会固定去申请使用某个子latch。
对于这个算法,我也比较赞同,之所以赞同是因为library cache中针对SQL文本进行hash运算,得到bucket编号。 另外oracle
cache buffer的算法也是类似:
假设file 1 block 437,那么计算方式应该是这样:
SQL> select mod(438,4) from dual;
MOD(438,4)
———-
2
即使该block的信息会存在编号为2的bucket里面。
这是一个大胆的猜测,从我目前的测试来看,似乎有点站不住脚。这一点我还需要进一步测试验证。