在 Oracle 数据库的世界里,通过10046事件跟踪解决未知问题,是 DBA 的重要技能之一。
掌握了Oracle数据库最为重要的跟踪方法,就可以在遇到问题时,快速定位根源。而找到问题根源,距离解决问题也就不远了——不论这些问题是已知的还是未知的。
以下一个案例来自于Oracle Database 12.2的版本,在数据库启动时遇到错误,数据库无法启动,抛出的异常是ORA-00600 908错误,关于这个错误在MOS上也没有说明(在撰文时未有任何已知资料),这属于最早发现的12.2的问题。
[oracle@enmocoredb ~]$ SQLplus / as sysdba
SQL*Plus: Release 12.2.0.0.3 Production on Mon Aug 1 09:59:54 2016
Copyright (c) 1982, 2016, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Database mounted.
ORA-00603: ORACLE server session terminated by fatal error
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00604: error occurred at recursive SQL level 1
ORA-00600: internal error code, arguments: [908], [], [], [], [], [], [], [], [], [], [], []
ORA-07445: exception encountered: core dump [ksupdbsesinc()+866] [SIGSEGV]
[ADDR:0x13650] [PC:0x6170C62] [Address not mapped to object] []
Process ID: 24446
Session ID: 30 Serial number: 42586
进一步的检查告警日志文件,获取后台记录的更详细信息,在这个案例中后台的异常日志和前台抛出的一致,包括跟踪日志,很难获得更明确的判断线索。
2016-08-01T10:02:56.913024+08:00
Errors in file /u01/app/oracle/diag/rdbms/eygle/eygle/trace/eygle_ora_24446.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00600: internal error code, arguments: [908], [], [], [], [], [], [], [], [], [], [], []
ORA-07445: exception encountered: core dump [ksupdbsesinc()+866] [SIGSEGV] [ADDR:0x13650] [PC:0x6170C62] [Address not mapped to object] []
Error 604 happened during db open, shutting down database
为了进一步分析这个问题,我们在Open数据库的阶段启用跟踪,以寻找最后出现问题的步骤,这一方法在实践中非常有效。
SQL> startup mount;
ORACLE instance started.
Database mounted.
SQL> alter session set events '10046 trace name context forever,level 12';
Session altered.
SQL> alter database open;
ORA-00600: internal error code, arguments: [908], [], [], [], [], [], [], [],[], [], [], []
现在可以在后台找到这个跟踪文件,通过定位最后出现问题的部分,获得线索。以下一段输出是数据库启动报错前最后执行的一段递归SQL。
=====================
PARSING IN CURSOR #0x7f19251432d8 len=123 dep=1 uid=0 oct=3 lid=0 tim=2423484984490 hv=1601912009 ad='0x6181e130' SQLid='65m6cgpgrqg69'
select /*+ NO_PARALLEL(c) */ c.con_id# from cdb_service$ c where lower(c.name) = lower(:1) or lower(c.name) = lower(:2)
END OF STMT
PARSE #0x7f19251432d8:c=0,e=293,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=2423484984490
BINDS #0x7f19251432d8:
Bind#0
oacdty=01 mxl=32(04) mxlc=00 mal=00 scl=00 pre=00
oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
kxsbbbfp=7f19266bdf70 bln=32 avl=04 flg=05
value="enmo"
Bind#1
oacdty=01 mxl=32(04) mxlc=00 mal=00 scl=00 pre=00
oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
kxsbbbfp=7f19266bdf38 bln=32 avl=04 flg=05
value="enmo"
EXEC #0x7f19251432d8:c=1000,e=594,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=2885951592,tim=2423484985147
FETCH #0x7f19251432d8:c=0,e=24,p=0,cr=3,cu=0,mis=0,r=1,dep=1,og=4,plh=2885951592,tim=2423484985192
STAT #0x7f19251432d8 id=1 cnt=1 pid=0 pos=1 obj=434 op='TABLE ACCESS FULL CDB_SERVICE$ (cr=3 pr=0 pw=0 str=1 time=23 us cost=2 size=16 card=1)'
CLOSE #0x7f19251432d8:c=0,e=41,dep=1,type=0,tim=2423484985251
kswscrs: error service is already defined in pdb 3.
: current pdb id=1
注意最后出现的提示:
kswscrs: error service is already defined in pdb 3,current pdb id=1。
这个提示给了我们一个重要线索,错误出现的原因是服务名已经在PDB 3中被定义和使用,因此出现了冲突。
再来看看最后执行的这个递归SQL,该SQL从cdb_service$表来取得服务名,进行验证。
select /*+ NO_PARALLEL(c) */ c.con_id# from cdb_service$ c where lower(c.name) = lower(:1) or lower(c.name) = lower(:2)
两个绑定变量的输入参数是:value="enmo" 。
找到了这样一个方向,进一步的检查数据库参数,发现在初始化参数中的确设置了一个服务名enmo,根据错误应该是这个服务名和PDB自动注册产生了冲突。
SQL> startup mount;
ORACLE instance started.
Database mounted.
SQL> show parameter service
NAME TYPE VALUE
------------------------------------ ----------- ------------------------
service_names string yhem,eygle,enmo
接下来尝试去掉这个服务名,重新启动数据库,数据库成功启动。
SQL> alter system set service_names='yhem,eygle';
System altered.
SQL> shutdown immediate;
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Database mounted.
Database opened.
SQL> show parameter service
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
service_names string yhem,eygle
在熟悉了Oracle的跟踪方法之后,就可以据此不断深入理解Oracle数据库的工作原理,在实践中应对各种异常,并直指根源,找到解决方案。
在12c的多租户环境,修改service_names参数应该非常谨慎,对于该参数的修改会直接反映到监听器的动态注册,甚至会覆盖PDB的动态注册,影响服务。以下过程展示这一知识点,在运维工作中尤其应当引起大家的注意和关注。
测试环境的CDB中存在两个PDB,分别是ENMO和YHEM2,同时service_names参数中存在两个服务名设定,分别是 yhem 和 eygle,这样数据库存在了四个服务名。
SQL> col name for a30
SQL> select con_id,dbid,name from v$pdbs;
CON_ID DBID NAME
---------- ---------- ------------------------------
2 612507346 PDB$SEED
3 965292808 ENMO
4 2839503056 YHEM2
SQL> show parameter service_names
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
service_names string yhem,eygle
在监听器中的注册的服务名如下表征(去除了不必要内容)。
[oracle@enmocoredb ~]$ lsnrctl status
LSNRCTL for Linux: Version 12.2.0.0.3 - Production on 03-AUG-2016 11:08:46
Copyright (c) 1991, 2016, Oracle. All rights reserved.
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Services Summary...
Service "enmo" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
Service "eygle" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
Service "yhem" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
Service "yhem2" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
The command completed successfully
如果通过如下方式修改了参数,去除部分服务名。
SQL> alter system set service_names='yhem';
System altered.
此时的监听状态会随之改变,包括PDB自动注册在内的服务名被一起清除。
[oracle@enmocoredb ~]$ lsnrctl status
LSNRCTL for Linux: Version 12.2.0.0.3 - Production on 03-AUG-2016 11:10:33
Copyright (c) 1991, 2016, Oracle. All rights reserved.
Services Summary...
Service "eygle" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
Service "yhem" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
The command completed successfully
如果遇到这种状况,这意味着所有通过服务名访问PDB的请求都会无法获取连接。解决这个问题的方法是,将PDB的服务名重新通过service_names参数进行显示的设置,就可以临时恢复这个问题。
SQL> alter system set service_names='yhem,yhem2,enmo';
SQL> !lsnrctl status
LSNRCTL for Linux: Version 12.2.0.0.3 - Production on 03-AUG-2016 15:19:49
Copyright (c) 1991, 2016, Oracle. All rights reserved.
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
Services Summary...
Service "enmo" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
Service "eygle" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
Service "yhem" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
Service "yhem2" has 1 instance(s).
Instance "eygle", status READY, has 1 handler(s) for this service...
The command completed successfully
但是注意,在12.2的已知问题未作为Bug修复之前,如果带着这个service_names参数设置重启数据库就会遇到前文所说的问题。