问题描述
通过DBCA在服务器上创建一个额外的实例,结束后以外发现,原有的数据库实例出现了异常。
简单描述一下系统环境,Oracle 11.2.0.2 for Linux x86-64,随后又安装了GRID,创建了ASM磁盘组。原有的数据库有部分表空间和REDO存放在ASM磁盘组中。
为了测试,通过DBCA创建了额外的实例,事实上,通过DBCA又先后创建了两个数据库实例,第二个数据库实例创建后一切正常,将第二个实例关闭,通过DBCA创建第三个数据库实例时,导致服务器上的第一个实例出现了异常。
检查实例1上的告警日志,可以发现大量下面的错误信息:
2011-08-30 10:20:54.502000 +08:00 Process W000 died, see its trace file 2011-08-30 10:20:58.518000 +08:00 Process W000 died, see its trace file 2011-08-30 10:21:02.524000 +08:00 Process W000 died, see its trace file 2011-08-30 10:21:06.527000 +08:00 Process W000 died, see its trace file Errors IN file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_j000_19030.trc: ORA-27140: attach TO post/wait facility failed ORA-27300: OS system dependent operation:invalid_egid failed WITH STATUS: 1 ORA-27301: OS failure message: Operation NOT permitted ORA-27302: failure occurred at: skgpwinit6 ORA-27303: additional information: startup egid = 1000 (oinstall), CURRENT egid = 1200 (dba) 2011-08-30 10:21:07.536000 +08:00 Process J000 died, see its trace file kkjcre1p: unable TO spawn jobq slave process Errors IN file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_cjq0_12308.trc: Errors IN file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_j000_19036.trc: ORA-27140: attach TO post/wait facility failed ORA-27300: OS system dependent operation:invalid_egid failed WITH STATUS: 1 ORA-27301: OS failure message: Operation NOT permitted ORA-27302: failure occurred at: skgpwinit6 ORA-27303: additional information: startup egid = 1000 (oinstall), CURRENT egid = 1200 (dba)
专家解答
数据库并非简单的报错而已,实际上即使是SQLPLUS / AS SYSDBA登录数据库,进行任何操作也会显示没有连接到数据库,因此对于实例1来说,只能执行SHUTDOWN ABORT关闭实例并重新启动。
查询METALINK发现不少类似的情况,导致这个错误产生的原因就是ORA_CRS_HOME/bin下的oracle可执行文件或者ORACLE_HOME/bin下的oracle可执行文件发生修改所致。虽然从随后的ORA-2730X信息中无法判断是GRID用户下还是ORACLE用户下的oracle文件被修改,但是可以确定的是,oracle可执行文件的组信息被修改,原本应该是oinstall,而现在变成了dba。
分别检查ORA_CRS_HOME/bin和ORACLE_HOME/bin下的oracle文件,发现是ORACLE目录下的ORACLE_HOME/bin/oracle发生了改变:
[oracle@dbserver1 ~]$ cd $ORACLE_HOME/bin [oracle@dbserver1 bin]$ ls -l oracle -rwsr-s--x 1 oracle dba 228881982 Jul 26 16:55 oracle [oracle@dbserver1 bin]$ ls -l |grep dba -rwsr-s--x 1 oracle dba 228881982 Jul 26 16:55 oracle -rwxr-xr-x 1 oracle oinstall 32372 Jul 26 16:55 osdbagrp -rwxr-xr-x 1 oracle oinstall 35677 Aug 20 2010 osdbagrp0
可以看到,只有oracle文件变成了dba组,bin目录下所有其他文件仍然是oinstall组。为了对比,在另外一个服务器上,有一台同时搭建的相同环境的oracle,检查oracle文件的组信息:
[oracle@dbserver2 ~]$ cd $ORACLE_HOME/bin [oracle@dbserver2 bin]$ ls -l oracle -rwsr-s--x. 1 oracle oinstall 217733745 Apr 27 14:02 oracle [oracle@dbserver2 bin]$ ls -l |grep dba -rwxr-xr-x. 1 oracle oinstall 32372 Apr 27 14:02 osdbagrp -rwxr-xr-x. 1 oracle oinstall 35677 Aug 20 2010 osdbagrp0
在这个服务器上,oracle文件的组属性仍然是oinstall,这说明节点1上的oracle组属性确实被修改了。
前面提到了,利用DBCA添加第二个实例的时候,数据库并未出现问题,而添加第三个实例的时候,才导致oracle组信息变化,那么同样是dbca的操作,为什么有所区别呢。对比第二个和第三个实例的区别,问题应该和是否使用ASM有关。第二个实例的所有文件都存储在本地硬盘,而第三个实例的所有文件都保存在ASM磁盘组中,甚至会和实例1共用一些磁盘组,应该就是这个差别导致了问题的产生。
检查ASM告警日志:
2011-08-30 10:21:08.853000 +08:00 NOTE: client al32utf8:al32utf8 registered, osid 19095, mbr 0x1 2011-08-30 10:53:27.195000 +08:00 NOTE: client al32utf8:al32utf8 deregistered 2011-08-30 10:53:31.248000 +08:00 Starting background process ASMB ASMB started WITH pid=42, OS id=23251 NOTE: client +ASM:+ASM registered, osid 23253, mbr 0x1 2011-08-30 10:53:36.804000 +08:00 NOTE: ASMB process exiting due TO lack OF ASM file activity FOR 5 seconds 2011-08-30 10:53:51.372000 +08:00 NOTE: client al32utf8:al32utf8 registered, osid 23441, mbr 0x1
可以看到,就是第三个实例注册到ASM的时刻,在实例1上出现了错误。应该就是这个时刻,oracle文件的组信息发生了变化。
检查一下实例3上的告警日志:
2011-08-30 10:20:53.722000 +08:00 Adjusting the DEFAULT VALUE OF parameter parallel_max_servers FROM 960 TO 135 due TO the VALUE OF parameter processes (150) Starting ORACLE instance (normal) 2011-08-30 10:20:55.149000 +08:00 LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Shared memory segment FOR instance monitoring created Picked latch-free SCN scheme 3 2011-08-30 10:20:56.823000 +08:00 USING LOG_ARCHIVE_DEST_1 parameter DEFAULT VALUE AS /u01/app/oracle/product/11.2.0/dbhome_1/dbs/arch Autotune OF undo retention IS turned ON. IMODE=BR ILAT =29 LICENSE_MAX_USERS = 0 SYS auditing IS disabled Starting up: Oracle DATABASE 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production WITH the Partitioning, OLAP, DATA Mining AND REAL Application Testing options. USING parameter settings IN client-side pfile /u01/app/oracle/admin/al32utf8/pfile/init.ora ON machine dbserver1 System parameters WITH non-DEFAULT VALUES: processes = 150 memory_target = 6432M db_block_size = 8192 compatible = "11.2.0.0.0" db_create_file_dest = "+DATA" undo_tablespace = "UNDOTBS1" remote_login_passwordfile= "EXCLUSIVE" db_domain = "" dispatchers = "(PROTOCOL=TCP) (SERVICE=al32utf8XDB)" local_listener = "LISTENER_AL32UTF8" audit_file_dest = "/u01/app/oracle/admin/al32utf8/adump" audit_trail = "DB" db_name = "al32utf8" open_cursors = 300 diagnostic_dest = "/u01/app/oracle" 2011-08-30 10:21:06.647000 +08:00 PMON started WITH pid=2, OS id=19032 PSP0 started WITH pid=3, OS id=19034 2011-08-30 10:21:07.721000 +08:00 VKTM started WITH pid=4, OS id=19038 at elevated priority VKTM running at (1)millisec PRECISION WITH DBRM quantum (100)ms GEN0 started WITH pid=5, OS id=19042 DIAG started WITH pid=6, OS id=19044 DBRM started WITH pid=7, OS id=19046 DIA0 started WITH pid=8, OS id=19048 MMAN started WITH pid=9, OS id=19050 DBW0 started WITH pid=10, OS id=19052 DBW1 started WITH pid=11, OS id=19054 DBW2 started WITH pid=12, OS id=19056 LGWR started WITH pid=13, OS id=19058 CKPT started WITH pid=14, OS id=19060 SMON started WITH pid=15, OS id=19062 RECO started WITH pid=16, OS id=19064 MMON started WITH pid=17, OS id=19066 MMNL started WITH pid=18, OS id=19068 starting up 1 dispatcher(s) FOR network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... starting up 1 shared server(s) ... ORACLE_BASE FROM environment = /u01/app/oracle CREATE DATABASE "al32utf8" MAXINSTANCES 8 MAXLOGHISTORY 1 MAXLOGFILES 16 MAXLOGMEMBERS 3 MAXDATAFILES 100 DATAFILE SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED EXTENT MANAGEMENT LOCAL SYSAUX DATAFILE SIZE 2048M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED SMALLFILE DEFAULT TEMPORARY TABLESPACE TEMP TEMPFILE SIZE 8192M AUTOEXTEND ON NEXT 640K MAXSIZE UNLIMITED SMALLFILE UNDO TABLESPACE "UNDOTBS1" DATAFILE SIZE 8192M AUTOEXTEND ON NEXT 5120K MAXSIZE UNLIMITED CHARACTER SET AL32UTF8 NATIONAL CHARACTER SET AL16UTF16 LOGFILE GROUP 1 SIZE 1024M, GROUP 2 SIZE 1024M, GROUP 3 SIZE 1024M USER SYS IDENTIFIED BY *USER SYSTEM IDENTIFIED BY Starting background process ASMB ASMB started WITH pid=22, OS id=19092 2011-08-30 10:21:08.901000 +08:00 Starting background process RBAL RBAL started WITH pid=23, OS id=19105 NOTE: initiating MARK startup Starting background process MARK MARK started WITH pid=24, OS id=19108 NOTE: MARK has subscribed 2011-08-30 10:21:12.325000 +08:00 NOTE: Loaded library: System SUCCESS: diskgroup DATA was mounted DATABASE mounted IN Exclusive Mode Lost WRITE protection disabled ERROR: failed TO establish dependency BETWEEN DATABASE al32utf8 AND diskgroup resource ora.DATA.dg 2011-08-30 10:21:36.862000 +08:00 Successful mount OF redo thread 1, WITH mount id 3103441300 Assigning activation ID 3103441300 (0xb8fac194) Thread 1 opened at log SEQUENCE 1 CURRENT log# 1 seq# 1 mem# 0: +DATA/al32utf8/onlinelog/group_1.265.760530073 Successful OPEN OF redo thread 1 MTTR advisory IS disabled because FAST_START_MTTR_TARGET IS NOT SET SMON: enabling cache recovery processing ?/rdbms/admin/dcore.bsq CREATE tablespace SYSTEM datafile SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED EXTENT MANAGEMENT LOCAL online 2011-08-30 10:21:44.431000 +08:00 Completed: CREATE tablespace SYSTEM datafile SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
可以确认实例1出现问题的时刻,正是实例3启动并在ASM磁盘组中创建表空间的时刻。
了解了问题的原委后,这个问题并不难解决,只需要重启实例1,就可以解决了。但是这个问题对于包含ASM的产品环境而言,存在很大的隐患,使用DBCA建立新库,会影响现有正在运行的数据库,这是一个埋藏比较深,且有可能造成严重后果的bug。
这个错误应该只会影响11.2以上版本,11.1以前的ASM在ORACLE_HOME目录下,应该不会导致这个问题的产生。