
环境说明:
DB:Oracle 11.2.0.4.0 RAC 两节点OS:Redhat 7.6
故障说明:
突然收到如下告警
【XXXX】[2021-xx-xx xx:xx:xx]-节点[CJC系统数据库节点2],连接失败!java.sql.SQLRecoverableException: IO Error: Got minus one from a read call[XXX][XX]
问题分析:
根据Got minus one from a read call提示,按照以往经验,初步怀疑数据库连接数满了。
1 首先要确定影响范围
检查数据库是否可以连接。ssh登录到本地,执行sqlplus。sqlplus as sysdba数据库连接登录,但是无法执行任何SQL语句,自动断开连接,看现象又不像连接数满的问题。目前影响范围:RAC节点1可以正常使用,节点2无法对外提供服务,如果有应用单指节点2,就会对业务有影响了。
2 检查数据库告警日志
发现数据库告警日志最近几天一直没有更新过,即数据库异常后,告警日志没有任何输出信息。
3 检查RAC集群日志
tail -10000 $GRID_HOME/log/`hostname`/alert*.log|more
发现故障期间有如下错误:
2021-04-16 18:31:01.023:[/oracle/crs/bin/oraagent.bin(2961)]CRS-5013:Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"2021-04-16 18:32:01.021:[/oracle/crs/bin/oraagent.bin(2961)]CRS-5013:Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"
根据提示,继续查看oraagent_grid.log日志
vi oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log
发现了如下日志:
2021-04-16 18:33:01.030: [ora.ons][3472258816]{2:31203:2} [check] (:CLSN00008:) category: -1, operation: fail, loc: canexec2, OS error: 0, other: no exe permission, file [/oracle/crs/opmn/bin/onsctli]2021-04-16 18:33:01.030: [ora.ons][3472258816]{2:31203:2} [check] clsnUtils::error Exception type=2 string=
可疑点如下:
no exe permission, file [/oracle/crs/opmn/bin/onsctli]
查看onsctli文件权限
[root@cjcdb02 ~]# ls -l oracle/crs/opmn/bin/onsctli-rwx------ 1 oracle oinstall 13760 Jan 9 2020 oracle/crs/opmn/bin/onsctli
在查看正常节点文件权限
[root@cjcdb01 db_check]# ls -l oracle/crs/opmn/bin/onsctli-rwx------ 1 grid oinstall 13760 Jan 9 2020 oracle/crs/opmn/bin/onsctli
可以看到,文件所有者由grid变成了oracle。
权限为什么突然变了呢?除了这个文件以外,还有其他文件权限也变了吗?
继续查看权限,发现ORACLE_HOME,GRID_HOME下所有文件都变了,权限都变成了oracle:oinstall
4 问题原因
为什么文件权限变了,由于不是我负责的数据库,具体原因不清楚,估计是厂商在调试系统时误操作了,更改权限时指定了错误的目录并且加-R递归修改了目录下所有文件(当前系统没有正式上线)。
5 解决方案
当前RAC第二节点ORACLE_HOME和GRID_HOME权限丢失,由于之前没有对文件权限做过备份,只能通过正常的节点1的权限对比来修复节点2文件权限。
但是文件数有80多万,手动比对和修复显然不现实,其实ORACLE已经提供了工具,可以批量生成指定目录下所有文件权限信息。
可以参考MOS:
Script to capture and restore file permission in a directory (for eg. ORACLE_HOME) (Doc ID1515018.1)
具体操作如下:
(1) 下载permission.pl脚本(2) 上传到节点1(正常节点)(3) 修改脚本权限chmod 755 permission.pl(4) 捕获某目录下所有文件权限信息./permission.pl <Path name to capture permission>(5) 将执行脚本生成的文件上传到节点2(权限异常节点)有如下两个脚本:a. permission-<time stamp> - This contains file permission in octal value, owner and group information of the files capturedb. restore-perm-<time stamp>.cmd - This contains command to change the permission, owner, and group of the captured files(6) 根据restore-perm-<time stamp>.cmd脚本内容,修改目录位置,将节点1目录修改为节点2目录。例如::%s/cjcdb01/cjcdb02/g:%s/ASM1/ASM2/g(7) 节点2执行修复后的restore-perm-<time stamp>.cmd脚本(8) 重启节点2crs及实例。
6 问题重现测试
(1)备份文件
[root@cjcdb02 ~]# cd oracle/crs/bin/[root@cjcdb02 bin]# ./crsctl stop crs[root@cjcdb02 ~]# tar -zcvf oracle_20210416bak.tar.gz oracle
(2)修改权限,模拟误操作
[root@cjcdb02 ~]# chown oracle.oinstall /oracle -R
(3)启动数据库
如果是在数据库启动情况下,修改的权限,数据库无法正常使用,无法通过shutdown immediate进行关闭,可以使用shutdown abort方式关闭。
[oracle@cjcdb02 ~]$ sqlplus / as sysdbaSQL*Plus: Release 11.2.0.4.0 Production on Fri Apr 16 18:31:06 2021Copyright (c) 1982, 2013, Oracle. All rights reserved.Connected to an idle instance.SQL> startupORA-01078: failure in processing system parametersORA-01565: error in identifying file '+DATA/kxx/spfilekxx.ora'ORA-17503: ksfdopn:10 Failed to open file +DATA/kxx/spfilekxx.oraORA-01034: ORACLE not availableORA-27123: unable to attach to shared memory segmentLinux-x86_64 Error: 13: Permission deniedAdditional information: 26Additional information: 1310725
报错:Permission denied
也无法关闭crs
[root@cjcdb02 bin]# ./crsctl stop crsCRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'cjcdb02'CRS-2673: Attempting to stop 'ora.crsd' on 'cjcdb02'CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'cjcdb02'CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'cjcdb02'CRS-2673: Attempting to stop 'ora.ARCH.dg' on 'cjcdb02'CRS-2673: Attempting to stop 'ora.OCR.dg' on 'cjcdb02'CRS-2673: Attempting to stop 'ora.DATA.dg' on 'cjcdb02'CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'cjcdb02' succeededCRS-2673: Attempting to stop 'ora.cjcdb02.vip' on 'cjcdb02'CRS-2677: Stop of 'ora.ARCH.dg' on 'cjcdb02' succeededCRS-2677: Stop of 'ora.DATA.dg' on 'cjcdb02' succeededCRS-2677: Stop of 'ora.cjcdb02.vip' on 'cjcdb02' succeededCRS-2672: Attempting to start 'ora.cjcdb02.vip' on 'cjcdb01'CRS-2676: Start of 'ora.cjcdb02.vip' on 'cjcdb01' succeededCRS-2677: Stop of 'ora.OCR.dg' on 'cjcdb02' succeededCRS-2673: Attempting to stop 'ora.asm' on 'cjcdb02'CRS-2677: Stop of 'ora.asm' on 'cjcdb02' succeededCRS-2673: Attempting to stop 'ora.ons' on 'cjcdb02'CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "stop": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"CRS-2675: Stop of 'ora.ons' on 'cjcdb02' failedCRS-2679: Attempting to clean 'ora.ons' on 'cjcdb02'CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "clean": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"CRS-2678: 'ora.ons' on 'cjcdb02' has experienced an unrecoverable failureCRS-0267: Human intervention required to resume its availability.CRS-2799: Failed to shut down resource 'ora.net1.network' on 'cjcdb02'CRS-2799: Failed to shut down resource 'ora.ons' on 'cjcdb02'CRS-2799: Failed to shut down resource 'ora.cjcdb02.ons' on 'cjcdb02'CRS-2794: Shutdown of Cluster Ready Services-managed resources on 'cjcdb02' has failedCRS-5022: Stop of resource "ora.crsd" failed: current state is "UNKNOWN"CRS-2675: Stop of 'ora.crsd' on 'cjcdb02' failedCRS-2799: Failed to shut down resource 'ora.crsd' on 'cjcdb02'CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'cjcdb02' has failedCRS-4687: Shutdown command has completed with errors.CRS-4000: Command Stop failed, or completed with errors.
[root@cjcdb02 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'cjcdb02'CRS-2673: Attempting to stop 'ora.crsd' on 'cjcdb02'CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'cjcdb02'CRS-2673: Attempting to stop 'ora.net1.network' on 'cjcdb02'CRS-2679: Attempting to clean 'ora.ons' on 'cjcdb02'CRS-2677: Stop of 'ora.net1.network' on 'cjcdb02' succeededCRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "clean": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"CRS-2680: Clean of 'ora.ons' on 'cjcdb02' failedCRS-2799: Failed to shut down resource 'ora.ons' on 'cjcdb02'CRS-2799: Failed to shut down resource 'ora.cjcdb02.ons' on 'cjcdb02'CRS-2794: Shutdown of Cluster Ready Services-managed resources on 'cjcdb02' has failedCRS-5022: Stop of resource "ora.crsd" failed: current state is "UNKNOWN"CRS-2675: Stop of 'ora.crsd' on 'cjcdb02' failedCRS-2799: Failed to shut down resource 'ora.crsd' on 'cjcdb02'CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'cjcdb02' has failedCRS-4687: Shutdown command has completed with errors.CRS-4000: Command Stop failed, or completed with errors.
[root@cjcdb02 bin]# ps -ef|grep d.bin
root 1041 1 0 18:25 ? 00:00:02 /oracle/crs/bin/ohasd.bin rebootroot 1140 1 0 Apr06 ? 00:00:00 /opt/ant-agent/agent/embedded/bin/python -m upgraderoot 1370 1 0 Apr06 ? 00:00:49 /opt/ant-agent/agent/embedded/bin/python /opt/ant-agent/agent/bin/circled --config /opt/ant-agent/agent/proc --log-output /opt/ant-agent/agent/logs/circle.log --daemonroot 1375 1370 0 Apr06 ? 00:16:36 /opt/ant-agent/agent/embedded/bin/python -m frameworkroot 1380 1370 0 Apr06 ? 00:00:00 /opt/ant-agent/agent/embedded/bin/python ./bin/salt-minion -c ./confroot 1381 1370 0 Apr06 ? 00:48:40 /opt/ant-agent/agent/embedded/bin/python ./agent/collect.py --use-local-forwardergrid 1384 1 0 18:25 ? 00:00:00 /oracle/crs/bin/mdnsd.binroot 1392 1370 0 Apr06 ? 00:12:53 /opt/ant-agent/agent/embedded/bin/python ./agent/forwarder.pygrid 1404 1 0 18:25 ? 00:00:00 /oracle/crs/bin/gpnpd.bingrid 1421 1 0 18:25 ? 00:00:01 /oracle/crs/bin/gipcd.binroot 1435 1 1 18:25 ? 00:00:06 /oracle/crs/bin/osysmond.bingrid 1476 1 0 18:25 ? 00:00:02 /oracle/crs/bin/ocssd.binroot 1512 1380 0 Apr06 ? 00:21:00 /opt/ant-agent/agent/embedded/bin/python ./bin/salt-minion -c ./confroot 1514 1512 0 Apr06 ? 00:00:00 /opt/ant-agent/agent/embedded/bin/python ./bin/salt-minion -c ./confroot 2191 1 0 18:26 ? 00:00:00 /oracle/crs/bin/octssd.bin rebootgrid 2342 1 0 18:26 ? 00:00:00 /oracle/crs/bin/evmd.binroot 2732 1 0 18:26 ? 00:00:02 /oracle/crs/bin/crsd.bin rebootroot 8108 30161 0 18:35 pts/0 00:00:00 grep --color=auto d.bin
(4)查看集群日志可以看到权限有问题
集群日志
2021-04-16 18:31:01.023:[/oracle/crs/bin/oraagent.bin(2961)]CRS-5013:Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"[root@cjcdb02 ~]# vi /oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log2021-04-16 18:33:01.030: [ora.ons][3472258816]{2:31203:2} [check] (:CLSN00008:) category: -1, operation: fail, loc: canexec2, OS error: 0, other: no exe permission, file [/oracle/crs/opmn/bin/onsctli]2021-04-16 18:33:01.030: [ora.ons][3472258816]{2:31203:2} [check] clsnUtils::error Exception type=2 string=[root@cjcdb02 ~]# ls -l /oracle/crs/opmn/bin/onsctli-rwx------ 1 oracle oinstall 13760 Jan 9 2020 /oracle/crs/opmn/bin/onsctli
(5)修复问题
节点1[root@cjcdb01 db_check]# pwd/db_check[root@cjcdb01 db_check]# ls -lrthtotal 4.0K-rw-r--r-- 1 root root 2.4K Apr 16 12:31 permission.pl[root@cjcdb01 db_check]# chmod 755 permission.pl[root@cjcdb01 db_check]# ./permission.pl /oracleFollowing log files are generatedlogfile : permission-Fri-Apr-16-12-34-00-2021Command file : restore-perm-Fri-Apr-16-12-34-00-2021.cmdLinecount : 438065[root@cjcdb01 db_check]# ll -rthtotal 113M-rwxr-xr-x 1 root root 2.4K Apr 16 12:31 permission.pl-rw-r--r-- 1 root root 75M Apr 16 12:36 restore-perm-Fri-Apr-16-12-34-00-2021.cmd-rw-r--r-- 1 root root 39M Apr 16 12:36 permission-Fri-Apr-16-12-34-00-2021
将生成的文件拷贝到节点2
修改部分内容后执行
[root@cjcdb02 db_check]# vi restore-perm-Fri-Apr-16-12-34-00-2021.cmd:%s/cjcdb01/cjcdb02/g[root@cjcdb02 db_check]# chmod 755 restore-perm-Fri-Apr-16-12-34-00-2021.cmd[root@cjcdb02 db_check]# ./restore-perm-Fri-Apr-16-12-34-00-2021.cmd
执行脚本修复权限过程中,查看文件权限逐渐恢复,告警日志
[root@cjcdb02 db_check]# tail -f /oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log2021-04-16 18:47:23.633: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:5422021-04-16 18:47:53.633: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:5462021-04-16 18:48:01.024: [ora.ons][3463853824]{2:31203:2} [check] Utils::getOracleHomeAttrib getEnvVar oracle_home:/oracle/crs2021-04-16 18:48:01.024: [ora.ons][3463853824]{2:31203:2} [check] Utils::getOracleHomeAttrib oracle_home:/oracle/crs2021-04-16 18:48:01.024: [ora.ons][3463853824]{2:31203:2} [check] Utils:execCmd action = 3 flags = 6 ohome = /oracle/crs/opmn/ cmdname = onsctli.2021-04-16 18:48:01.127: [ora.ons][3463853824]{2:31203:2} [check] (:CLSN00010:)ons is running ...2021-04-16 18:48:01.127: [ora.ons][3463853824]{2:31203:2} [check] (:CLSN00010:)2021-04-16 18:48:01.127: [ora.ons][3463853824]{2:31203:2} [check] execCmd ret = 02021-04-16 18:48:23.634: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:5502021-04-16 18:48:31.024: [CLSFRAME][3590973248] TM [MultiThread] is changing desired thread # to 2. Current # is 32021-04-16 18:48:31.024: [CLSFRAME][3472258816]{0:1:26} Worker thread is exiting in TM [MultiThread] to meet the desired count of 2. New count is 22021-04-16 18:48:53.634: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:5542021-04-16 18:49:01.025: [ora.ons][3463853824]{2:31203:2} [check] Utils::getOracleHomeAttrib getEnvVar oracle_home:/oracle/crs2021-04-16 18:49:01.025: [ora.ons][3463853824]{2:31203:2} [check] Utils::getOracleHomeAttrib oracle_home:/oracle/crs2021-04-16 18:49:01.025: [ora.ons][3463853824]{2:31203:2} [check] Utils:execCmd action = 3 flags = 6 ohome = /oracle/crs/opmn/ cmdname = onsctli.2021-04-16 18:49:01.127: [ora.ons][3463853824]{2:31203:2} [check] (:CLSN00010:)ons is running ...2021-04-16 18:49:01.127: [ora.ons][3463853824]{2:31203:2} [check] (:CLSN00010:)2021-04-16 18:49:01.127: [ora.ons][3463853824]{2:31203:2} [check] execCmd ret = 02021-04-16 18:49:01.127: [CLSFRAME][3590973248] TM [MultiThread] is changing desired thread # to 3. Current # is 22021-04-16 18:49:23.635: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:558[root@cjcdb02 ~]# ls -l /oracle/crs/opmn/bin/onsctli-rwx------ 1 grid oinstall 13760 Jan 9 2020 /oracle/crs/opmn/bin/onsctli
执行完脚本后,可以正常重启crs和db实例
[root@cjcdb02 bin]# ./crsctl start crsCRS-4123: Oracle High Availability Services has been started.
crs启动日志如下:
2021-04-16 18:52:36.699:[ohasd(3331)]CRS-2112:The OLR service started on node cjcdb02.2021-04-16 18:52:36.704:[ohasd(3331)]CRS-1301:Oracle High Availability Service started on node cjcdb02.2021-04-16 18:52:36.704:[ohasd(3331)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred2021-04-16 18:52:39.979:[/oracle/crs/bin/orarootagent.bin(3423)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).2021-04-16 18:52:44.379:[gpnpd(3566)]CRS-2328:GPNPD started on node cjcdb02.2021-04-16 18:52:46.727:[cssd(3640)]CRS-1713:CSSD daemon is started in clustered mode2021-04-16 18:52:48.562:[ohasd(3331)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE2021-04-16 18:53:10.861:[cssd(3640)]CRS-1707:Lease acquisition for node cjcdb02 number 2 completed2021-04-16 18:53:12.132:[cssd(3640)]CRS-1605:CSSD voting file is online: /dev/asm-ocr1; details in /oracle/crs/log/cjcdb02/cssd/ocssd.log.2021-04-16 18:53:12.141:[cssd(3640)]CRS-1605:CSSD voting file is online: /dev/asm-ocr3; details in /oracle/crs/log/cjcdb02/cssd/ocssd.log.2021-04-16 18:53:12.153:[cssd(3640)]CRS-1605:CSSD voting file is online: /dev/asm-ocr2; details in /oracle/crs/log/cjcdb02/cssd/ocssd.log.2021-04-16 18:53:17.591:[cssd(3640)]CRS-1601:CSSD Reconfiguration complete. Active nodes are cjcdb01 cjcdb02 .2021-04-16 18:53:19.987:[ctssd(4180)]CRS-2407:The new Cluster Time Synchronization Service reference node is host cjcdb01.2021-04-16 18:53:19.987:[ctssd(4180)]CRS-2401:The Cluster Time Synchronization Service started on host cjcdb02.2021-04-16 18:53:21.724:[ohasd(3331)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE2021-04-16 18:53:28.016:[ctssd(4180)]CRS-2403:The Cluster Time Synchronization Service on host cjcdb02 is in observer mode.2021-04-16 18:53:28.989:[ctssd(4180)]CRS-2409:The clock on host cjcdb02 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2021-04-16 18:53:42.006:[crsd(4517)]CRS-1012:The OCR service started on node cjcdb02.2021-04-16 18:53:42.017:[evmd(4254)]CRS-1401:EVMD started on node cjcdb02.2021-04-16 18:53:43.061:[crsd(4517)]CRS-1201:CRSD started on node cjcdb02.
实例也可以正常启动
[root@cjcdb02 bin]# su - oracleLast login: Fri Apr 16 18:30:56 CST 2021 on pts/0[oracle@cjcdb02 ~]$ sqlplus / as sysdbaSQL*Plus: Release 11.2.0.4.0 Production on Fri Apr 16 18:54:22 2021Copyright (c) 1982, 2013, Oracle. All rights reserved.Connected to an idle instance.SQL> startupORACLE instance started.Total System Global Area 3724607488 bytesFixed Size 2258760 bytesVariable Size 771754168 bytesDatabase Buffers 2936012800 bytesRedo Buffers 14581760 bytesDatabase mounted.Database opened.
SQL> select count(*),inst_id from gv$session group by inst_id;COUNT(*) INST_ID---------- ----------42 142 2
注意:
虽然本次案例节点2可以正常恢复,但是在实际生产环境中,同一套RAC下不同节点的文件及文件权限可能千差万别,生产操作一定要有备份,能回退。
###2021-04-16 23:30 chenjuchao###





