最近有朋友在 墨天轮 提出一个问题,数据库遇到了 ORA-600 15711 错误。
这个错误是比较少见的,MOS 上相关的BUG只有一个:
Bug 3212516 Select from GV$ views can fail with OERI[15711]
这个BUG是和 RAC 相关的,在查询 GV$ 视图时发生,这个BUG的描述是:
Parallel operations across nodes on GV$ views can fail with
ORA-600 [15711] particularly if the other node/s are changing
state (up/down/up)
在跨节点查询 GV$ 视图使用并行操作时出现,通常这时候对方节点处于异常状态,例如正在启动或者关闭。
这个BUG对应的数据库版本是 9.2.0.7,在那之后,就再也没有确认的BUG了。
那么我们看看这个新提出来的问题,系统环境是 Oracle 11.2.0.3 的 AIX 平台 RAC 集群版本:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle/product/11gr2
System name: AIX
在跟踪文件的头部展示了错误发生的应用模块,这个错误是由于 OGG 进行数据抽取引起的:
*** MODULE NAME:(OGG-EXTJC1-OCI_META_THREAD) 2019-09-02 21:22:21.965
*** ACTION NAME:() 2019-09-02 21:22:21.965
而错误抛出的SQL同样是查询 GV$instance 视图:
----- Current SQL Statement for this session (sql_id=48k7nky3dkt0d) -----
SELECT TO_CHAR(startup_time, 'YYYY-MM-DD HH24:MI:SS') FROM gv$instance WHERE inst_id = 2
所以这个问题和Bug 3212516是同样的问题,说明错误发生在执行 GV$ 跨实例查询时,对方节点的状态存在问题。
具体来看,这个问题的错误号如下:
Dump continued from file: ora_9044270.trc
ORA-00600: internal error code, arguments: [15711], [4], [0x700000C38298638], [0x700000B9F93DBD0], [], [], [], [], [], [], [], []
其错误堆栈如下:
----- Current SQL Statement for this session (sql_id=48k7nky3dkt0d) -----
SELECT TO_CHAR(startup_time, 'YYYY-MM-DD HH24:MI:SS') FROM gv$instance WHERE inst_id = 2
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+40 bl 107c691e8 000000000 ? 000000001 ?
000000003 ? 000000000 ?
000000000 ? 000000001 ?
000000003 ? 000000000 ?
ksedst1()+112 call skdstdst() 15C0A1C5F9AB5ACB ?
48442841312F7377 ?
FFFFFFFFFFEF9F0 ? 000002004 ?
1106E2F80 ? 10A3B0164 ?
000000000 ? 1106E2F80 ?
ksedst()+40 call ksedst1() 30300000000 ? 002050033 ?
10A3B0158 ? 700000000025C ?
000000000 ? 000000000 ?
10A3AF7B8 ? 000000000 ?
dbkedDefDump()+1516 call ksedst() 000000000 ? 000000000 ?
FFFFFFFFFFEFBE0 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ? 300000003 ?
ksedmp()+72 call dbkedDefDump() 300000000 ? 110000AF8 ?
FFFFFFFFFFF01D0 ?
4844424409A016E4 ?
10011C824 ? 109A016D8 ?
FFFFFFFFFFF0220 ? 11065B098 ?
ksfdmp()+100 call ksedmp() 000000000 ? 000000000 ?
009A016F3 ? 10A9442A8 ?
200000000000000 ? 000000000 ?
11096D7B8 ? 1106E2F80 ?
dbgexPhaseII()+1940 call ksfdmp() 000002004 ? 1106E2F80 ?
000000000 ? FFFFFFFFFFF0398 ?
FFFFFFFFFFF02C0 ? 000000000 ?
10024B578 ? 110000B30 ?
dbgexProcessError() call dbgexPhaseII() 1106E2F80 ? 11096B9C8 ?
+932 000081232 ? 200000000 ?
FFFFFFFFFFF0FA8 ? 000000082 ?
900000008DCC538 ? 000000000 ?
dbgeExecuteForError call dbgexProcessError() 1106E2F80 ? 11096D7B8 ?
()+72 1FFFF43D8 ? 01065B098 ?
700000B8BC45108 ? 000000000 ?
000000EA0 ? 11096F500 ?
dbgePostErrorKGE()+ call dbgeExecuteForError FFFFFFFFFFF4410 ? 000000000 ?
2048 () 10212A990 ? 700000B9F93DBD0 ?
FFFFFFFFFFF4470 ? 000000004 ?
111368458 ? 000000000 ?
dbkePostKGE_kgsf()+ call dbgePostErrorKGE() 110000CE0 ? 1099C7CA0 ?
68 25800000000 ? 000000000 ?
000000258 ? 11097DD38 ?
000000000 ? 000003D5F ?
kgeade()+364 call dbkePostKGE_kgsf() 100402458 ? 000000000 ?
700000B9CF4DBD8 ? 110B73688 ?
10A1335E0 ? 10A1235E0 ?
110975E22 ? 000000001 ?
kgeriv_int()+116 call kgeade() 000000016 ? 700000C38298638 ?
000000012 ? 000000013 ?
110B736B0 ? 110B73688 ?
100000000 ? 23490A1703423 ?
kgeriv()+36 call kgeriv_int() 002000008 ? 000000083 ?
FFFFFFFFFFF5280 ? 00000005A ?
0109D2290 ? 1109D4710 ?
80001109D2290 ? 1109D46C8 ?
kgeasi()+232 call kgeriv() 000000016 ? 002050000 ?
10A1335E0 ? 10A1235E0 ?
110B73688 ? 110975DD8 ?
000000000 ? 700000000044F20 ?
kxfpg1sg()+3228 call kgeasi() 700000B9CF4DBD8 ?
700000C38298638 ? E02050000 ?
000000001 ? 300000003 ?
000000000 ? 000000004 ?
000000002 ?
kxfpgsg()+7544 call kxfpg1sg() 110975D38 ? 100000001 ?
1004022E8 ? 110C6A252 ?
10A1235E0 ? 00000FFFF ?
000000EA0 ? 000000009 ?
kxfrAllocSlaves()+5 call kxfpgsg() 4FFFF5FC8 ? 100000000000000 ?
08 10A1BD910 ? 1FFFF5FE8 ?
1FFFF5FE0 ? 11099F840 ?
11099F848 ? 28300011099FD38 ?
kxfrialo()+2088 call kxfrAllocSlaves() 000000000 ? 000000000 ?
FFFFFFFFFFF61F0 ? 000000000 ?
10389057C ? 10AC9C558 ?
10AC9C558 ? 000000001 ?
kxfralo()+424 call kxfrialo() 000000000 ? 100000000000001 ?
1000000000001 ?
700000B74381280 ? 1110CA810 ?
000000000 ? 700000B74381240 ?
000000000 ?
qerpx_rowsrc_start( call kxfralo() 0FFFF6A80 ? 100000020024282 ?
)+1336 100010212A990 ?
700000B74381280 ? 1110CA810 ?
110000B30 ? 700000B74381240 ?
000000000 ?
qerpxStart()+552 call qerpx_rowsrc_start( 000000000 ? 000000000 ?
) 100000028 ? 000000000 ?
000000000 ? 000000048 ?
000000000 ? 000000002 ?
selexe0()+780 call qerpxStart() 000000001 ? 000000040 ?
8000000000000000 ?
110000B30 ? 1103713D8 ?
004080001 ? 022480021 ?
000000001 ?
opiexe()+14728 call selexe0() 7000002DB9E4610 ? B11366508 ?
000000000 ? FFFFFFFFFFF77B8 ?
000000000 ? FFFFFFFFFFF79B8 ?
FFFFFFFFFFF78F0 ? 000000000 ?
kpoal8()+4616 call opiexe() FFFFFFFFFFF8BA0 ? 11065B098 ?
FFFFFFFFFFF8DC0 ? 100000001 ?
000000000 ? 11065B098 ?
000000000 ? 000000000 ?
opiodr()+720 call kpoal8() 1FFFF9C40 ?
3554434A00001080 ?
101DDF038 ? 000000000 ?
FFFFFFFFFFF92E0 ? 110004EC0 ?
1106E2F80 ? 000000018 ?
ttcpip()+1028 call opiodr() 5EFD61C1B1 ? 1CA1B87408 ?
FFFFFFFFFFFAD80 ? E00000000 ?
000000000 ? 700000C82CFDFC0 ?
FFFFFFFFFFFA7A0 ? 000000000 ?
opitsk()+1508 call ttcpip() 11010C8A0 ? 11065B098 ?
1000BA100 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ? 110004B50 ?
opiino()+940 call opitsk() 110043CF0 ? 000000000 ?
110746E70 ? 11074A830 ?
1106E2F80 ? FFFFFFFFFFFCE60 ?
FFFFFFFFFFFEEBC ? 000000101 ?
opiodr()+720 call opiino() 3C009A783C ?
BFF0000000000000 ?
FFFFFFFFFFFF2E0 ?
FFFFFFFFFFFD975 ?
FFFFFFFFFFFD9B0 ? 11065B098 ?
FFFFFFFFFFFD9D0 ?
9FFFFFFF000D4D0 ?
opidrv()+1132 call opiodr() 3C0A871588 ? 41010B8A8 ?
FFFFFFFFFFFF2E0 ? 07530312F ?
107860E04 ? 11065B098 ?
732F79786A632F59 ?
11065B098 ?
sou2o()+136 call opidrv() 3C0786ABF0 ? 400000000 ?
FFFFFFFFFFFF2E0 ?
110003001D0000 ? 000150908 ?
000000000 ? 9001000A006F230 ?
11065B098 ?
opimai_real()+608 call sou2o() FFFFFFFFFFFF350 ? 000000001 ?
9000000000D55EC ?
BADC0FFEE0DDF00D ?
000000002 ? 9001000A007F990 ?
A0000000A000000 ? 10B136F60 ?
ssthrdmain()+268 call opimai_real() 9001000A0011C70 ?
FFFFFFFFFFFF6A8 ?
FFFFFFFFFFFF450 ?
15828C9BE8ED50DC ?
9000000000850D0 ?
9001000A007F990 ?
A0000000A000000 ?
9001000A007F990 ?
main()+204 call ssthrdmain() 2F0003640 ? FFFFFFFFFFFF798 ?
FFFFFFFFFFFF800 ?
9FFFFFFF000D4E8 ?
9FFFFFFF00009A0 ? 000000000 ?
000000000 ? 9FFFFFFF000D4E8 ?
__start()+112 call main() 000000000 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
进一步的来分析:
2019-09-02 21:22:21.837413*:PX_Messaging:kxfp.c@10829:kxfpg1sg():
q=0x700000b9cf4dbd8 req_threads=1 nthreads=1 unit=1 #inst=1 gv$
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ99 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ98 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ97 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ96 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ95 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ94 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ93 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ92 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ91 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ90 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ89 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ88 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ87 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ86 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ85 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ84 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ83 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ82 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ81 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ80 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ79 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ78 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ77 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ76 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ75 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ74 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ73 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ72 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11439:kxfpg1srv(): trying to get slave PZ71 on instance 2 for q=0x700000b9cf4dbd8
PX_Messaging:kxfp.c@11481:kxfpg1srv(): slave PZ71 is remote (inst=2)
PX_Messaging:kxfp.c@11517:kxfpg1srv(): - acquired dp=0x0
PX_Messaging:kxfp.c@10952:kxfpg1sg(): Got It. 1 so far.
PX_Messaging:kxfp.c@11211:kxfpg1sg(): need to allocate qref to slaves
PX_Messaging:kxfp.c@11225:kxfpg1sg(): qref=0x700000c38298638 qref.sid=3571, qref.iid=2, ser=397825 qdp.sid=3571 qdp.iid=2
PX_Messaging:kxfp.c@11300:kxfpg1sg(): wait reply from qref 0x700000c38298638
-------------------------------------------------------------------------------
注意以上这段堆栈的执行过程,当实例1发起一个关于 GV$ 的并行查询,然后开始向实例2请求并行进程,一直请求不到,直达找到 PZ71,然后在执行并行查询之前,需要获得 qref Latch。
QREF 是 会话等待的进程队列的地址。在这里是:0x700000C38298638 。
这个地址也正是 ORA-00600 错误抛出的异常参数之一:
ORA-00600: internal error code, arguments: [15711], [4], [0x700000C38298638], [0x700000B9F93DBD0], [], [], [], [], [], [], [], []
在 Trace 文件中可以找到这个地址,其内容正是:Queue Reference ,这里可以看到,请求的信息在实例 2 ,进程号是 3571 :
----------------------------------------
SO: 0x700000c38298638, type: 95, owner: 0x700000b9cf4dbd8, flag: -/-/-/0x00 if: 0x1 c: 0x1
proc=0x700000c797815e0, name=process queue reference, file=kxfp.h LINE:3731 ID:, pg=0
Queue Reference--kxfpqr=0x700000c38298638 ser=397825 seq=1 error=0
opp qref=0x0 process=0x0 bufs:{0x700000bcfaa2bb0, 0x700000bdf7735c0}
state=01100 flags= nulls=0 hint:0x0 pseq=0 omh=0x700000bcfaa2bb0
latch 0x700000c38298798, remote descriptor:
Queue Descriptor--kxfpqd=0x700000c382986c0 remote queue=0x700000c67f8b228
instance id=2 server id=3571 numa#=0 flags= INIT
recovery info -- opr=0 bufs:{0x700000bdf7735c0, 0x700000bcfaa2bb0} state=10011 smh=0x700000bcfaa2bb0
现在的问题在于,对方迟迟没有给出回复:
PX_Messaging:kxfp.c@11225:kxfpg1sg(): qref=0x700000c38298638 qref.sid=3571, qref.iid=2, ser=397825 qdp.sid=3571 qdp.iid=2
PX_Messaging:kxfp.c@11300:kxfpg1sg(): wait reply from qref 0x700000c38298638
等待超时,然后本地进程就出现了错误。
综合来看,这个问题的原因就在于并行查询的队列处理出现超时,根源在于对方节点未能及时提供资源响应,要么是存在阻塞、锁定,是其根源。通过进一步分析对方节点的 ash 和 awr 应该能够找到更多信息。
此外,这个问题多数出现在 11.2.0.3 版本中,也建议升级到 11.2.0.4 再做观察。