1、背景描述
最近某个开发的数据库频繁异常宕机,系统开发反馈每天早上需要手动启动数据库才行。昨天早上跟开发要了数据库的IP和账号密码,首先登上去看了下alert日志,发现有如下报错
Tue Dec 14 02:14:09 2021
Errors in file /software/oracle/app/diag/rdbms/db-01/db-01/trace/gzwa_dbw2_20900.trc:
ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 43
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1
Errors in file /software/oracle/app/diag/rdbms/db-01/db-01/trace/gzwa_dbw3_20902.trc:
ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 43
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1
以下是trace日志
error 27157 detected in background process
ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 43
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1
alert日志也能看到实例恢复后数据库正常OPEN
ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
parallel recovery started with 3 processes
Started redo scan
Completed redo scan
read 494 KB redo, 39 data blocks need recovery
Started redo application at
Thread 1: logseq 16941, block 39204
Recovery of Online Redo Log: Thread 1 Group 5 Seq 1730 Reading mem 0
Mem# 0: /software/oracle/app/oradata/db-01/redo03.log
Completed redo application of 0.24MB
某度上说可能是定时任务执行的脚本导致的问题,使用oracle用户contable看了下定时任务
[oracle@db-01 ~]$ crontab -l
0 2 * * * sh /software//oracle/oracle/expdp_db_bak.sh > /dev/null 2>&1
这是一个自动备份并清理历史备份任务的脚本,手动执行了下脚本并没有发现问题,但是这个脚本用oracle用户执行的,当执行完后退出oracle用户,就会出问题,MOS解释如下
As a result, when the last oracle or grid user disconnects, the OS removes shared memory segments and semaphores for those users.
As Oracle ASM and Databases use shared memory segments for SGA, removing shared memory segments will crash the Oracle ASM and database instances. (MOS 文档 ID 2081410.1)
2、环境介绍
操作系统:CentOS Linux release 7.2.1511(Core)
数据库:Oracle 11.2.0.1.0单实例环境,未打补丁
3、解决方案
根据MOS文档(2081410.1)解释如下
The semaphores used by Oracle have been inadvertendly removed
The errors are signalling that something happened at the OS level with shared memory and/or semaphores. The semaphore sets could be removed manually, or they could be dying for some reason due to a hardware error.
Either when remounting the /dev/shm or You may want to check for any possibility of a user dba using the "ipcrm" command to kill the semaphores (accidentally) since the error ora-27301 (OS failure message: Identifier removed) suggests that. Also, it could have been a bad memory stick or something else at the OS level. Someone could also have removed the shared memory segments at the OS level for some specific reason, or by accident. Most likely something had removed the shared memory and semaphore sets in use by 'oracle'. This can only be done by a root-level user or 'oracle' itself who owns the resources. If someone logged in as root and removed all IPC resources, Oracle would crash when it lost the allocated shared memory/semaphores.
solution:
This could be due to some outside user or application removing the semaphores/shared memory.
To monitor the semaphore/shared memory state we can use the following methods:
Setup a cronjob to run every 5-10min and dump the output of 'ipcs' and 'ps - ef' to a file with a timestamp.
Rotate your logs every 4-7 days to build a history.
Then if the problem re-occurs, we can at least try to make sure 'ipcrm' wasn't the culprit and get some general information of the state of the IPC resources plus the processes running.
You can also consult with your sysadmin to check if there is any OS level auditing that can be turned on to audit the usage of commands like 'ipcrm' which can remove shared memory segments /semaphore sets.
Note: This issue can happen on different platform, but in case you encounter the issue in RHEL7.2, then please also check below RHEL7.2 specific information.
In RHEL7.2 operating system setting RemoveIPC=YES crashes the database.The default value for RemoveIPC in RHEL7.2 is YES.
Workaround :
1) Set RemoveIPC=no in /etc/systemd/logind.conf if it is not in that file
2) Reboot the server or restart systemd-logind as follows:
# systemctl daemon-reload
# systemctl restart systemd-logind
OR
Migrate to Oracle Linux 7.2 resolves the problem.
4、操作命令
4.1、在/etc/systemd/logind.conf修改RemoveIPC
RemoveIPC=no
4.2、重启服务
systemctl daemon-reload
systemctl restart systemd-logind




