1、基础环境
操作系统:Red Hat Enterprise Linux Server release 6.7 (Santiago)
数据库:oracle 11.2.0.4
架构:oracle rac双节点
2、问题现象
操作alter session kill进程导致数据库异常挂起,scan-ip无法正常访问,节点1的集群挂起,节点2集群正常,数据库有异常日志
3、问题排查
3.1 查看2节点
查看2节点的数据库以及监听状态都是正常的,但是有部分异常日志
ORA-06512: 脭脷 line 2 Wed May 08 12:02:15 2024 opiodr aborting process unknown ospid (29466) as a result of ORA-28 Wed May 08 12:02:54 2024 opiodr aborting process unknown ospid (1178) as a result of ORA-28 Wed May 08 12:03:20 2024 opiodr aborting process unknown ospid (11861) as a result of ORA-28 Wed May 08 12:05:08 2024 Errors in file /oracle/app/oracle/diag/rdbms/erpdb/erpdb2/trace/erpdb2_ora_18783.trc: ORA-00604: 碌脻鹿茅 SQL 录露卤冒 1 鲁枚脧脰麓铆脦贸 ORA-01031: 脠篓脧脼虏禄脳茫 ORA-06512: 脭脷 line 2 Wed May 08 12:05:15 2024 Errors in file /oracle/app/oracle/diag/rdbms/erpdb/erpdb2/trace/erpdb2_ora_18796.trc: ORA-00604: 碌脻鹿茅 SQL 录露卤冒 1 鲁枚脧脰麓铆脦贸 ORA-01031: 脠篓脧脼虏禄脳茫 ORA-06512: 脭脷 line 2 Wed May 08 12:06:05 2024 opiodr aborting process unknown ospid (16037) as a result of ORA-28 Wed May 08 12:06:36 2024 opiodr aborting process unknown ospid (14515) as a result of ORA-28 Wed May 08 12:09:38 2024 opiodr aborting process unknown ospid (12474) as a result of ORA-28 Wed May 08 12:54:39 2024
3.2 重启2节点数据库
重启二节点的数据库,但是应用无法正常连接,使用scan-ip和vip都是无法连接
查看scan-ip相关信息,发现scan ip没有飘到2节点,关闭该节点的集群服务还是不能行
3.3 scan-ip查看
rac01
[root@rac01 bin]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:9c:49:7e brd ff:ff:ff:ff:ff:ff
inet 10.101.8.71/24 brd 10.101.8.255 scope global eth0
inet 10.101.8.73/24 brd 10.101.8.255 scope global secondary eth0:1
inet 10.101.8.70/24 brd 10.101.8.255 scope global secondary eth0:2
inet6 fe80::250:56ff:fe9c:497e/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:9c:4e:4a brd ff:ff:ff:ff:ff:ff
inet 192.168.2.71/24 brd 192.168.2.255 scope global eth1
inet 169.254.150.24/16 brd 169.254.255.255 scope global eth1:1
inet6 fe80::250:56ff:fe9c:4e4a/64 scope link
valid_lft forever preferred_lft forever
rac02
[root@rac02 bin]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:9c:5e:65 brd ff:ff:ff:ff:ff:ff
inet 10.101.8.72/24 brd 10.101.8.255 scope global eth0
inet 10.101.8.74/24 brd 10.101.8.255 scope global secondary eth0:1
inet6 fe80::250:56ff:fe9c:5e65/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:9c:0c:d5 brd ff:ff:ff:ff:ff:ff
inet 192.168.2.72/24 brd 192.168.2.255 scope global eth1
inet 169.254.155.91/16 brd 169.254.255.255 scope global eth1:1
inet6 fe80::250:56ff:fe9c:cd5/64 scope link
valid_lft forever preferred_lft forever
3.4 重启1节点的操作系统
二节点
[root@rac02 bin]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:9c:5e:65 brd ff:ff:ff:ff:ff:ff
inet 10.101.8.72/24 brd 10.101.8.255 scope global eth0
inet 10.101.8.74/24 brd 10.101.8.255 scope global secondary eth0:1
inet 10.101.8.73/24 brd 10.101.8.255 scope global secondary eth0:2
inet 10.101.8.70/24 brd 10.101.8.255 scope global secondary eth0:3
inet6 fe80::250:56ff:fe9c:5e65/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:9c:0c:d5 brd ff:ff:ff:ff:ff:ff
inet 192.168.2.72/24 brd 192.168.2.255 scope global eth1
inet 169.254.155.91/16 brd 169.254.255.255 scope global eth1:1
inet6 fe80::250:56ff:fe9c:cd5/64 scope link
valid_lft forever preferred_lft forever
[root@rac02 bin]#
发现集群ip已经飘过来了,经测试业务也是可以正常使用的。
4 节点问题排查
4.1 查看1节点日志
2024-05-08 13:18:16.972:
[crsd(12192)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/crs/rac01/A0615872
crsd(12192)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/crs/rac01/A4934540
4.2 磁盘清理
查看本地磁盘空间确实是满了,清理相关日志信息
4.3 启动集群服务
启动集群服务,但是还是报错,日志信息如下
2024-05-08 13:30:40.682:
[cssd(11653)]CRS-1625:Node rac02, number 2, was manually shut down
2024-05-08 13:30:40.690:
[cssd(11653)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac01 .
2024-05-08 13:30:40.750:
[crsd(12192)]CRS-5504:Node down event reported for node 'rac02'.
2024-05-08 13:30:40.773:
[crsd(12192)]CRS-2773:Server 'rac02' has been removed from pool 'Generic'.
2024-05-08 13:30:40.789:
[crsd(12192)]CRS-2773:Server 'rac02' has been removed from pool 'ora.archives'.
2024-05-08 13:30:40.789:
[crsd(12192)]CRS-2773:Server 'rac02' has been removed from pool 'ora.erpdb'.
2024-05-08 13:30:40.789:
[crsd(12192)]CRS-2773:Server 'rac02' has been removed from pool 'ora.hrdb'.
2024-05-08 13:30:41.591:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A0579656
2024-05-08 13:30:41.622:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A2772085
2024-05-08 13:30:41.657:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A2575363
2024-05-08 13:30:41.704:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A0142633
2024-05-08 13:30:41.761:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A8520041
2024-05-08 13:30:41.803:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A6722337
2024-05-08 13:30:44.654:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A5993362
2024-05-08 13:30:45.162:
[/oracle/app/1120/grid/bin/oraagent.bin(12431)]CRS-5818:Aborted command 'check' for resource 'ora.archives.db'. Details at (:CRSAGF00113:) {1:34110:2} in /oracle/app/1120/grid/log/rac01/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2024-05-08 13:30:51.184:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A7950983
2024-05-08 13:30:58.640:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /oracle/app/1120/grid/auth/evm/rac01/A0030995
2024-05-08 13:30:59.171:
[evmd(11781)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
4.4 文件夹权限
对比1节点和2节点该目录的权限都是一样的 /oracle/app/1120/grid/auth/evm
4.4 inode使用数
发现根目录的文件数满了,导致无法创建文件
[root@rac01 bin]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda3 7774208 7774208 0 100% /
tmpfs 4126795 276 4126519 1% /dev/shm
/dev/sda1 51200 39 51161 1% /boot
/dev/sdb 9830400 902 9829498 1% /home/oracle/rman_backup
/dev/sdc1 1310720 2033 1308687 1% /nbubackup
4.5 清理inode文件
查询该目录下的文件数大概有23w多,清理早期的文件
[root@rac01 audit]# pwd
/oracle/app/1120/grid/rdbms/audit
rw-r----- 1 grid oinstall 779 Aug 5 2016 +ASM1_ora_5427_20160805155359989656143795.aud
-rw-r----- 1 grid oinstall 779 Aug 5 2016 +ASM1_ora_5431_20160805155400249241143795.aud
-rw-r----- 1 grid oinstall 779 Aug 5 2016 +ASM1_ora_5433_20160805155400325413143795.aud
-rw-r----- 1 grid oinstall 774 Aug 5 2016 +ASM1_ora_5443_20160805155401126727143795.aud
-rw-r----- 1 grid oinstall 779 Aug 5 2016 +ASM1_ora_5479_20160805155401973889143795.aud
-rw-r----- 1 grid oinstall 779 Aug 5 2016 +ASM1_ora_5481_20160805155402051248143795.aud
-rw-r----- 1 grid oinstall 779 Aug 5 2016 +ASM1_ora_5483_20160805155402218923143795.aud
-rw-r----- 1 grid oinstall 774 Aug 5 2016 +ASM1_ora_5838_20160805155413597912143795.aud
-rw-r----- 1 grid oinstall 964 Aug 5 2016 +ASM1_ora_5375_20160805155353787727143795.aud
-rw-r----- 1 grid oinstall 748 Aug 5 2016 +ASM1_ora_6469_20160805155653467804143795.aud
-rw-r----- 1 grid oinstall 773 Aug 5 2016 +ASM1_ora_6469_20160805155657452824143795.aud
-rw-r----- 1 grid oinstall 774 Aug 5 2016 +ASM1_ora_6522_20160805155657491816143795.aud
-rw-r----- 1 grid oinstall 774 Aug 5 2016 +ASM1_ora_6557_20160805155705795581143795.aud
-rw-r----- 1 grid oinstall 774 Aug 5 2016 +ASM1_ora_6561_20160805155705878798143795.aud
-rw-r----- 1 grid oinstall 774 Aug 5 2016 +ASM1_ora_6565_20160805155705947289143795.aud
-rw-r----- 1 grid oinstall 774 Aug 5 2016 +ASM1_ora_7069_20160805155748988695143795.aud
-rw-r----- 1 grid oinstall 776 Aug 5 2016 +ASM1_ora_10165_20160805160831004935143795.aud
4.6 启动集群
再次启动,发现集群和数据库都起来了
Name Type Target State Host
------------------------------------------------------------
ora.EASARCH.dg ora....up.type ONLINE ONLINE rac01
ora.EASDATA.dg ora....up.type ONLINE ONLINE rac01
ora.ERPARCH.dg ora....up.type ONLINE ONLINE rac01
ora.ERPDATA.dg ora....up.type ONLINE ONLINE rac01
ora.HRARCH.dg ora....up.type ONLINE ONLINE rac01
ora.HRDATA.dg ora....up.type ONLINE ONLINE rac01
ora....ER.lsnr ora....er.type ONLINE ONLINE rac01
ora....N1.lsnr ora....er.type ONLINE ONLINE rac02
ora.OCRNEW.dg ora....up.type ONLINE ONLINE rac01
ora....ives.db ora....se.type ONLINE OFFLINE
ora.asm ora.asm.type ONLINE ONLINE rac01
ora.cvu ora.cvu.type ONLINE ONLINE rac02
ora.erpdb.db ora....se.type ONLINE ONLINE rac01
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora.hrdb.db ora....se.type ONLINE OFFLINE
ora....network ora....rk.type ONLINE ONLINE rac01
ora.oc4j ora.oc4j.type ONLINE ONLINE rac02
ora.ons ora.ons.type ONLINE ONLINE rac01
ora....SM1.asm application ONLINE ONLINE rac01
ora....01.lsnr application ONLINE ONLINE rac01
ora.rac01.gsd application OFFLINE OFFLINE
ora.rac01.ons application ONLINE ONLINE rac01
ora.rac01.vip ora....t1.type ONLINE ONLINE rac01
ora....SM2.asm application ONLINE ONLINE rac02
ora....02.lsnr application ONLINE ONLINE rac02
ora.rac02.gsd application OFFLINE OFFLINE
ora.rac02.ons application ONLINE ONLINE rac02
ora.rac02.vip ora....t1.type ONLINE ONLINE rac02
ora....ry.acfs ora....fs.type ONLINE ONLINE rac01
ora.scan1.vip ora....ip.type ONLINE ONLINE rac02
最后修改时间:2024-05-09 10:07:07
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




