暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

停电后,集群服务异常,删除 /var/tmp/.oracle

原创 liketoochao 2024-04-28
365

1.故障描述

因多次停电导致集群服务异常,节点2无法正常运行。

2.故障分析

step 1.查看 ohas 服务

[root@bzhissrv2 ~]# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[root@bzhissrv2 ~]# 

step 2.查看 alter 日志,发现集群服务启动异常并一直提示mdnsd 启动失败 ,并分析 mdnsd 进程日志

================================================================================
2023-11-22 11:08:34.438: [ default][1019287296]mdnsd START pid=37580 
2023-11-22 11:08:34.440: [ COMMCRS][1010734848]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=bzhissrv2DBG_MDNSD))

2023-11-22 11:08:34.441: [  clsdmt][1012836096]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=bzhissrv2DBG_MDNSD))
2023-11-22 11:08:34.441: [  clsdmt][1012836096]Terminating process
2023-11-22 11:08:34.441: [    MDNS][1012836096] clsdm requested mdnsd exit
2023-11-22 11:08:34.441: [    MDNS][1012836096] mdnsd exit
2023-11-22 11:12:44.442: [ default][3684009728]

step 3.查看 /var/tmp/.oracle 目录下的属主权限如图

[grid@bzhissrv2 tmp]$ ls -la
总用量 12
drwxrwxrwt.  3 root   root     4096 7月   5 10:52 .
drwxr-xr-x. 23 root   root     4096 11月  4 2022 ..
drwxrwxrwt   2 oracle oinstall 4096 11月 22 10:49 .oracle
[grid@bzhissrv2 tmp]$ 

正常节点

[root@bzhissrv1 tmp]# ls -la
总用量 20
drwxrwxrwt.  3 root root      4096 2月  13 2023 .
drwxr-xr-x. 23 root root      4096 11月  4 2022 ..
drwxrwxrwt   2 root oinstall 12288 11月 22 11:05 .oracle
[root@bzhissrv1 tmp]# 

step 4.删除 /var/tmp/.oracle ,后集群服务开始启动

[cssd(38422)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/bzhissrv2/cssd/ocssd.log
2023-11-22 11:26:27.187: 
[cssd(38422)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/bzhissrv2/cssd/ocssd.log
2023-11-22 11:26:42.196: 
[cssd(38422)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/bzhissrv2/cssd/ocssd.log
2023-11-22 11:26:57.205: 
[cssd(38422)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/bzhissrv2/cssd/ocssd.log
2023-11-22 11:27:27.754: 
[cssd(38422)]CRS-1707:Lease acquisition for node bzhissrv2 number 2 completed
2023-11-22 11:27:29.043: 
[cssd(38422)]CRS-1605:CSSD voting file is online: ORCL:OCR2; details in /u01/app/11.2.0/grid/log/bzhissrv2/cssd/ocssd.log.
2023-11-22 11:27:29.053: 
[cssd(38422)]CRS-1605:CSSD voting file is online: ORCL:OCR1; details in /u01/app/11.2.0/grid/log/bzhissrv2/cssd/ocssd.log.
2023-11-22 11:27:29.067: 
[cssd(38422)]CRS-1605:CSSD voting file is online: ORCL:OCR0; details in /u01/app/11.2.0/grid/log/bzhissrv2/cssd/ocssd.log.

日志提示无法发现磁盘组,手工扫描磁盘组

[root@bzhissrv2 .oracle]# oracleasm configure
ORACLEASM_ENABLED=true
ORACLEASM_UID=grid
ORACLEASM_GID=asmdba
ORACLEASM_SCANBOOT=true
ORACLEASM_SCANORDER=""
ORACLEASM_SCANEXCLUDE=""
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"
[root@bzhissrv2 .oracle]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...
Instantiating disk "DATA0"
Instantiating disk "DATA3"
Instantiating disk "DATA1"
Instantiating disk "DATA2"
Instantiating disk "DATA4"
Instantiating disk "DATA5"
Instantiating disk "DATA6"
Instantiating disk "OCR0"
Instantiating disk "OCR2"
Instantiating disk "OCR1"
[root@bzhissrv2 .oracle]# 

step 5.继续观察日志

2023-02-14 17:50:50.778: [ora.oradb.db][360920832]{2:35133:39296} [stop] InstAgent::stop pool pConnxn e000da30
2023-02-14 17:50:50.778: [ora.oradb.db][360920832]{2:35133:39296} [stop] InstConnection::connectInt: server not attached
2023-02-14 17:50:50.803: [ora.oradb.db][360920832]{2:35133:39296} [stop] ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 54327 (asmadmin), current egid = 54321 (oinstall)

此时集群服务处于异常状态,无法正常关闭启动 has 和 crs 。

step 6.重启主机
如果RAC或者HAS下

  1. 在Linux平台上,Network Socket File在/var/tmp/.oracle/目录下。在其他平台,可能的目录有:/tmp/.oracle/*, /tmp/.oracle 或者 /usr/tmp/.oracle
  2. 如果CRS或者HAS没有启动,删除oracle临时文件(Network Socket File),在CRS重启后会自动重新创建,没有不良影响。
  3. 如果CRS或者HAS已经启动并正常运行中,删除oracle临时文件,不影响数据库运行,但是数据库不能正常关闭(可以abort,但是不能启动)
  4. 如果出现了上面的情况3,CRS不能关闭(包括使用-f选项),只能手工清理共享内存段和kill 进程。在HAS中,kill ocssd.bin进程不会造成主机重启。但是在RAC环境下kill ocssd.bin进程会造成主机重启。
  5. 如果完成了上面的情况4,只需要重启CRS或者HAS就可以了。
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论