2

诡异-ORA-12547: TNS:lost contact导致数据库无法启动

许玉晨 2020-06-28
1520
摘要:摘要:一个诡异的案例:ORA-12547: TNS:lost contact导致数据库无法启动,甚至sqlplus都无法登录,让我们一一来解开这个案例的真面目

1.背景概述

某客户出现数据库无法启动的情况,申请云和恩墨协助分析和处置。云和恩墨工程师快速响应,组织相关人员进行故障诊断分析、指出故障原因,提出解决措施并处置,快速恢复了业务。

下面是详细的故障分析诊断过程,以及详细的解决方案描述。

2.故障分析

2.1. 故障现象

数据库无法启动,数据库监听状态异常。

Thu Apr 30 15:40:20 2020 NOTE: ASMB terminating Errors in file /oracle/app/oracle/diag/rdbms/****/****/trace/****_asmb_8258020.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 595 Serial number: 9 Errors in file /oracle/app/oracle/diag/rdbms/****/****/trace/****_asmb_8258020.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 595 Serial number: 9 ASMB (ospid: 8258020): terminating the instance due to error 15064 Thu Apr 30 15:40:20 2020 System state dump requested by (instance=1, osid=8258020 (ASMB)), summary=[abnormal instance termination]. System State dumped to trace file /oracle/app/oracle/diag/rdbms/****/****/trace/****_diag_8389092_20200430154020.trc Dumping diagnostic data in directory=[cdmp_20200430154020], requested by (instance=1, osid=8258020 (ASMB)), summary=[abnormal instance termination]. Instance terminated by ASMB, pid = 8258020

发现数据库的asm实例也出现异常

2.2. 故障根源

去分析asm的alert日志

SQL> ALTER DISKGROUP DATA DISMOUNT /* asm agent *//* {0:0:49022} */ Thu Apr 30 15:40:19 2020 Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_gmon_7405636.trc: ORA-29746: Cluster Synchronization Service is being shut down. GMON (ospid: 7405636): terminating the instance due to error 29746 Thu Apr 30 15:40:20 2020 System state dump requested by (instance=1, osid=7405636 (GMON)), summary=[abnormal instance termination]. System State dumped to trace file /oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_diag_7406038_20200430154020.trc Dumping diagnostic data in directory=[cdmp_20200430154020], requested by (instance=1, osid=7405636 (GMON)), summary=[abnormal instance termination]. Instance terminated by GMON, pid = 7405636

ASM实例异常。
image.png
Ora.asm的资源是offline状态。

尝试关闭has,然后重新启动has,再看看asm是否可以启动

#/oracle/grid/bin/crsctl start has Could not send msg exec /oracle/grid/perl/bin/perl -I/oracle/grid/perl/lib /oracle/grid/bin/crswrapexece.pl /oracle/grid/crs/install/s_crsconfig_***_env.txt /oracle/grid/bin/ohasd.bin "reboot" . Please retry 2020-05-02 11:16:25 Changing directory to /oracle/grid/log/***/ohasd

has启动失败,查看asm agent日志

2020-05-02 11:18:57.898: [ora.asm][3343]{0:0:2} [clean] InstAgent::stop: connect2 oracleHome /oracle/grid oracleSid +ASM 2020-05-02 11:18:57.898: [ora.asm][3343]{0:0:2} [clean] InstConnection::connectInt: server not attached 2020-05-02 11:18:57.936: [ora.asm][3343]{0:0:2} [clean] ORA-12547: TNS:lost contact 2020-05-02 11:18:57.936: [ora.asm][3343]{0:0:2} [clean] InstConnection::connectInt (1) Exception OCIException 2020-05-02 11:18:57.936: [ora.asm][3343]{0:0:2} [clean] InstConnection:connect:excp OCIException OCI error 12547 2020-05-02 11:18:57.937: [ora.asm][3343]{0:0:2} [clean] InstConnection::connectInt: server not attached 2020-05-02 11:18:57.975: [ora.asm][3343]{0:0:2} [clean] ORA-12547: TNS:lost contact 2020-05-02 11:18:57.975: [ora.asm][3343]{0:0:2} [clean] InstConnection::connectInt (1) Exception OCIException 2020-05-02 11:18:57.975: [ora.asm][3343]{0:0:2} [clean] InstAgent::stop: connect2 errcode 12547 2020-05-02 11:18:57.976: [ora.asm][3343]{0:0:2} [clean] clsnUtils::error Exception type=2 string= ORA-12547: TNS:lost contact 2020-05-02 11:18:57.976: [ AGFW][3343]{0:0:2} sending status msg [ORA-12547: TNS:lost contact ] for clean for resource: ora.asm **** 1 2020-05-02 11:18:57.976: [ora.asm][3343]{0:0:2} [clean] ConnectionPool::removeConnection connection count 1 2020-05-02 11:18:57.976: [ora.asm][3343]{0:0:2} [clean] ConnectionPool::removeConnection sid +ASM, InstConnection 11471d30 2020-05-02 11:18:57.976: [ USRTHRD][3343]{0:0:2} InstConnection::breakCall pConnxn:11471d30 DetachLock:1059c2f0 m_pSvcH:00000000 2020-05-02 11:18:57.976: [ USRTHRD][3343]{0:0:2} InstConnection:~InstConnection: this 11471d30

Asm启动过程中报InstConnection:connect:excp OCIException OCI error 12547之后启动失败。

我们发现sqlplus / as sysdba登录也会出现TNS 12547的报错
通过truss 去跟踪sqlplus
image.png

发现在读写sqlnet.log 时候报错,怀疑是oracle本身有问题,查询metalink证实了这个想法:Troubleshooting ORA-12547 TNS: Lost Contact [ID 555565.1]。 然后尝试去relink。

image.png

relink的日志出现ksh: /dev/null: 0403-005 Cannot create the specified file.的报错。mos 上/dev/null: 0403-005 Cannot create the specified file" Error While Relinking In AIX (Doc ID 1900029.1)和IBM官方文章:https://www.ibm.com/support/pages/file-access-permissions-do-not-allow-specified-action都有记录此问题的解决方案

image.png
故障时候/dev/null的权限是660,执行chmod 666 /dev/null,sqlplus / as sysdba不再显示ORA-12547: TNS:lost contact的错误。

2.3. 故障处置

尝试重新启动has,ASM实例正常启动,但是数据库实例无法启动。
image.png

手工启动数据库。又出现CRS-5016: Process "/oracle/grid/bin/setasmgidwrap等错误
image.png

020-05-02 13:25:00.251: [ora.****.db][1800]{0:0:2} [start] InstConnection::connectInt (2) Exception OCIException 2020-05-02 13:25:00.251: [ora.****.db][1800]{0:0:2} [start] InstConnection:connect:excp OCIException OCI error 1034 2020-05-02 13:25:00.251: [ora.****.db][1800]{0:0:2} [start] InstAgent::stop: connect1 errcode 1034 2020-05-02 13:25:00.251: [ora.****.db][1800]{0:0:2} [start] InstAgent::stop: connect2 oracleHome /oracle/app/oracle/product/11.2.0/dbhome_1 oracleSid **** 2020-05-02 13:25:00.251: [ora.****.db][1800]{0:0:2} [start] InstConnection::connectInt: server not attached 2020-05-02 13:25:00.319: [ora.****.db][1800]{0:0:2} [start] ORA-01017: invalid username/password; logon denied 2020-05-02 13:25:00.319: [ora.****.db][1800]{0:0:2} [start] InstConnection::connectInt (2) Exception OCIException 2020-05-02 13:25:00.319: [ora.****.db][1800]{0:0:2} [start] InstConnection:connect:excp OCIException OCI error 1017 2020-05-02 13:25:00.319: [ora.****.db][1800]{0:0:2} [start] InstAgent::stop: connect2 errcode 1017 2020-05-02 13:25:00.319: [ora.****.db][1800]{0:0:2} [start] clsnUtils::error Exception type=2 string= ORA-01017: invalid username/password; logon denied 2020-05-02 13:25:00.319: [ AGFW][1800]{0:0:2} sending status msg [ORA-01017: invalid username/password; logon denied ] for start for resource: ora.****.db 1 1

看来问题是ORA-01017: invalid username/password; logon denied
手工执行sqlplus / as sysdba也出现这样的报错
image.png

根据oracle官方文章ORA-01017-Liunx-connecting using slash(sqlplus /as sysdba)中的描述
image.png
解决办法:

执行sqlplus sys as sysdba登录。 执行startup,手动启动数据库

3.根本解决方案及建议

有人问了,为什么/dev/null的权限会发生变化?到底为什么?并且客户反馈修改成666之后不到一分钟/dev/null的权限又再次变成660。

进过排查,IBM文章Oracle agent (rz) changes /dev/null permission from 666 to 660得到了答案,比如itm监控oralce的agent,未设置TNS_ADMIN变量,则代理会将Oracle客户端日志文件重定向到/dev/null。当代理以root用户身份运行时,Oracle API调用会导致/dev/null的权限从666更改为660。
image.png

为了避免/dev/null的权限从666更改为660有如下几种方法:

1、可以以非root用户身份安装,配置和启动代理。 2、在/opt/IBM/ITM/config/rz_ORCL.config文件中,添加以下设置 TNS_ADMIN='/opt/oracle/product/11.2.0/dbhome_1/network/admin' 3、如果要使所有代理程序具有相同的TNS_ADMIN,编辑rz.ini以添加TNS_ADMIN设置,例如: TNS_ADMIN=/opt/oracle/product/11.2.0/dbhome_1/network/admin

临时解决办法:

chmod 666 /dev/null

image.png

「喜欢文章,快来给作者赞赏墨值吧」
1人已赞赏

评论

3
2
最新发布
暂无内容,敬请期待...
数据库资讯
最新 热门 更多
本月热门
近期活动
全部
暂无活动,敬请期待...
相关课程
全部
暂无课程,敬请期待...