暂无图片
暂无图片
18
暂无图片
暂无图片
暂无图片

ORA-27157 ORA-27300 ORA-27301 ORA-27302

1、背景描述

      最近某个开发的数据库频繁异常宕机,系统开发反馈每天早上需要手动启动数据库才行。昨天早上跟开发要了数据库的IP和账号密码,首先登上去看了下alert日志,发现有如下报错

Tue Dec 14 02:14:09 2021

Errors in file /software/oracle/app/diag/rdbms/db-01/db-01/trace/gzwa_dbw2_20900.trc:

ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 43
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1
Errors in file /software/oracle/app/diag/rdbms/db-01/db-01/trace/gzwa_dbw3_20902.trc:
ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 43
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1

      以下是trace日志

error 27157 detected in background process
ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 43
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1


      alert日志也能看到实例恢复后数据库正常OPEN

ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
parallel recovery started with 3 processes
Started redo scan
Completed redo scan
read 494 KB redo, 39 data blocks need recovery
Started redo application at
Thread 1: logseq 16941, block 39204
Recovery of Online Redo Log: Thread 1 Group 5 Seq 1730 Reading mem 0
Mem# 0: /software/oracle/app/oradata/db-01/redo03.log
Completed redo application of 0.24MB


     某度上说可能是定时任务执行的脚本导致的问题,使用oracle用户contable看了下定时任务

[oracle@db-01 ~]$ crontab -l
0 2 * * * sh  /software//oracle/oracle/expdp_db_bak.sh > /dev/null 2>&1

     这是一个自动备份并清理历史备份任务的脚本,手动执行了下脚本并没有发现问题,但是这个脚本用oracle用户执行的,当执行完后退出oracle用户,就会出问题,MOS解释如下

As a result, when the last oracle or grid user disconnects, the OS removes shared memory segments and semaphores for those users.
As Oracle ASM and Databases use shared memory segments for SGA, removing shared memory segments will crash the Oracle ASM and database instances. (MOS 文档 ID 2081410.1)


2、环境介绍

      操作系统:CentOS Linux release 7.2.1511(Core)

      数据库:Oracle 11.2.0.1.0单实例环境,未打补丁


3、解决方案

      根据MOS文档(2081410.1)解释如下

The semaphores used by Oracle have been inadvertendly removed
The errors are signalling that something happened at the OS level with shared memory and/or semaphores. The semaphore sets could be removed manually, or they could be dying for some reason due to a hardware error.
Either when remounting the /dev/shm or You may want to check for any possibility of a user dba using the "ipcrm" command to kill the semaphores (accidentally) since the error ora-27301 (OS failure message: Identifier removed) suggests that. Also, it could have been a bad memory stick or something else at the OS level. Someone could also have removed the shared memory segments at the OS level for some specific reason, or by accident. Most likely something had removed the shared memory and semaphore sets in use by 'oracle'. This can only be done by a root-level user or 'oracle' itself who owns the resources. If someone logged in as root and removed all IPC resources, Oracle would crash when it lost the allocated shared memory/semaphores.


solution:



This could be due to some outside user or application removing the semaphores/shared memory.
To monitor the semaphore/shared memory state we can use the following methods:

Setup a cronjob to run every 5-10min and dump the output of 'ipcs' and 'ps - ef' to a file with a timestamp.
Rotate your logs every 4-7 days to build a history.
Then if the problem re-occurs, we can at least try to make sure 'ipcrm' wasn't the culprit and get some general information of the state of the IPC resources plus the processes running.

You can also consult with your sysadmin to check if there is any OS level auditing that can be turned on to audit the usage of commands like 'ipcrm' which can remove shared memory segments /semaphore sets.

Note: This issue can happen on different platform, but in case you encounter the issue in RHEL7.2, then please also check below RHEL7.2 specific information.

In RHEL7.2 operating system setting RemoveIPC=YES crashes the database.The default value for RemoveIPC in RHEL7.2 is YES.

Workaround :

1) Set RemoveIPC=no in /etc/systemd/logind.conf if it is not in that file

2) Reboot the server or restart systemd-logind as follows:
# systemctl daemon-reload
# systemctl restart systemd-logind

OR

Migrate to Oracle Linux 7.2 resolves the problem.

    

 

4、操作命令

4.1、在/etc/systemd/logind.conf修改RemoveIPC


      RemoveIPC=no

4.2、重启服务


      systemctl daemon-reload

      systemctl restart systemd-logind
最后修改时间:2021-12-15 18:22:05
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论