作者:bytehouse
Oracle ACE、PostgreSQL ACE
10+年数据库架构与运维实战经验
公众号:bytehouse
墨天轮专栏:bytehouse
CSDN:Young DBA
本文摘要:突发的一个故障,让我想来来关注下oracle asm 实例的密码文件相关的问题。
故障概述
ASM 密码文件orapwasm丢失、误删、覆盖损坏,重启 CRS 后集群认证异常,出现以下标准化故障现象:
-
单节点 ASM 无法随集群自动启动,
ora.storage资源卡在STARTING状态无法流转; -
集群alert日志、GI后台trace日志持续报错:
ORA-01017: invalid username/password; logon denied; -
故障节点执行
crsctl stat res -t报错:CRS-4535: Cannot communicate with Cluster Ready Services; -
手动通过
sqlplus / as sysasm执行startup可强行拉起ASM实例,集群临时恢复,但重启集群后故障百分百复现。
实验一:asmcmd --nocp credfix 凭证修复
实验环境
-
GI版本:19.30.0.0.0
-
前置条件:ASM密码文件物理存在(仅损坏/权限异常,非彻底删除)、无提前备份、两节点root用户提前配置SSH免密互信
-
适用场景:密码文件覆盖损坏、集群凭证和ASM密码不匹配
步骤1:集群正常状态下的用户
[grid@rac1 ~]$ asmcmd lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE CRSUSER__ASM_004 TRUE FALSE TRUE ASMSNMP TRUE FALSE FALSE ORACLE_148 TRUE FALSE FALSE
集群全资源ONLINE,双节点ASM、磁盘组、数据库、VIP、SCAN监听均可随CRS自动启动,无任何认证报错。
步骤2:模拟ASM密码文件损坏故障
直接覆盖原有密码文件,破坏SYS管理员核心权限
[grid@rac1 ~]$ orapwd file='+dg_ocr/orapwasm' asm=y force=y password=Password123*
校验损坏后权限:SYS丢失sysasm集群核心权限(故障核心特征)
[grid@rac1 ~]$ asmcmd lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE FALSE
步骤3:节点重启CRS
两节点分别执行
# 关闭集群
crsctl stop crs
# 启动集群
crsctl start crs
故障现象
- rac1节点集群资源查询:单节点ASM正常,另一节点全部资源离线
[grid@rac1 ~]$ crsctl stat res -t
ora.asm(ora.asmgroup)
1 ONLINE ONLINE rac1 Started,STABLE
2 ONLINE OFFLINE STABLE
- rac2节点上层资源查询:无法连接集群CRS服务
[grid@rac2 ~]$ crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
- rac2节点底层init资源查询:crsd离线,storage资源卡死启动中
[grid@rac2 ~]$ crsctl stat res -t -init
ora.crsd
1 ONLINE OFFLINE STABLE
ora.storage
1 ONLINE OFFLINE rac2 STARTING
- 后台日志报错:GI跟踪日志固定输出
ORA-01017: invalid username/password; logon denied
步骤4:手动拉起故障节点ASM实例
[grid@rac2 ~]$ sqlplus / as sysasm SQL> startup ASM instance started Total System Global Area 1137173320 bytes Fixed Size 8905544 bytes Variable Size 1103101952 bytes ASM Cache 25165824 bytes ASM diskgroups mounted
说明:手动启动后集群临时恢复,但底层OCR/OLR凭证依旧异常,重启集群故障必然复现。
步骤5:校验集群凭证异常状态
[grid@rac1 ~]$ asmcmd --nocp credverify
credverify: No credentials in password file, please run 'credfix' to fix the credentials.
步骤6:配置节点SSH免密互信
配置两节点root用户SSH免密互信(必须前置,否则credfix报错KFOD-00610)
[root@rac1 ~]$ /u01/app/19.3.0/grid/oui/prov/resources/scripts/sshUserSetup.sh -user root -hosts "rac1 rac2 " -advanced -noPromptPassphrase
步骤7:root执行credfix自动修复OCR/OLR集群凭证
grid用户执行该命令会直接报错KFOD-00610,必须使用root执行
[root@rac1 ~]$ asmcmd --nocp credfix
# 完整无删减实测输出
credfix: Credentials for CRSUSER__ASM_004 not in password file, trying next credential.
op=addcrscreds wrap=/tmp/creds0.xml
op=credstoxml wrap=/tmp/new_creds.xml
op=credimport wrap=/tmp/new_creds.xml olr=true force=true
credfix: OLR for rac1 has been fixed if credentials were created incorrectly.
credfix: Starting SSH session on node rac2.
credfix: OLR for rac2 has been fixed if credentials were created incorrectly. Exiting SSH session.
op=delcrscreds crs_user=CRSUSER__ASM_004
credfix: Deleted CRSUSER__ASM_004 from OCR.
credverify: Credentials created correctly on rac1.
credverify: Starting SSH session on node rac2
credverify: Credentials created correctly on rac2. Exiting SSH session.
credfix: Credentials have been fixed if they were created incorrectly.
修复说明:自动删除旧失效集群认证账号CRSUSER__ASM_004,生成全新账号CRSUSER__ASM_005,同步修复双节点OLR本地注册表与OCR集群注册表。
每次账号 CRSUSER__ASM_00X 都会+ 1。
步骤8:补全业务监控用户权限
(credfix不会自动重建业务用户)
# 给SYS补回缺失的sysasm核心权限
asmcmd orapwusr --grant sysasm SYS
# 重建监控用户ASMSNMP并赋权
asmcmd orapwusr --add ASMSNMP
asmcmd orapwusr --grant sysdba ASMSNMP
# 重建业务用户ORACLE_148并赋权
asmcmd orapwusr --add ORACLE_148
asmcmd orapwusr --grant sysdba ORACLE_148
权限校验,确认和初始状态一致:
[grid@rac1 ~]$ asmcmd lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE CRSUSER__ASM_005 TRUE FALSE TRUE ASMSNMP TRUE FALSE FALSE ORACLE_148 TRUE FALSE FALSE
步骤9:节点重启CRS,验证修复
crsctl stop crs crsctl start crs
结果校验
-
集群资源校验:
crsctl stat res -t双节点所有资源全部自动ONLINE,无离线、无中间态资源; -
凭证校验:
[grid@rac1 ~]$ asmcmd --nocp credverify credverify: Credentials created correctly on rac1. credverify: Starting SSH session on node rac2 credverify: Credentials created correctly on rac2. Exiting SSH session.
无任何报错,集群ASM密码文件凭证完全修复,重启无故障复现。
实验二:密码文件备份还原
步骤1:前置必备
将ASM磁盘组内密码文件备份至本地文件系统
[grid@rac1 ~]$ mkdir -p /home/grid/backup
[grid@rac1 ~]$ asmcmd pwcopy +DG_OCR/orapwasm /home/grid/backup/orapwasm.bak
# 校验备份文件完整性
[grid@rac1 ~]$ ll /home/grid/backup/orapwasm.bak
-rw-r----- 1 grid oinstall 21504 Dec 2 15:16 orapwasm.bak
步骤2:删除ASM磁盘组内密码文件
[grid@rac1 ~]$ asmcmd
ASMCMD> rm -rf +DG_OCR/orapwasm
ASMCMD> exit
步骤3:重启CRS复现故障
故障现象和实验一完全一致:rac2节点crsd离线、ora.storage卡死STARTING、日志ORA-01017,手动startup可临时拉起ASM。
步骤4:将本地备份文件还原至ASM原始路径
[grid@rac1 ~]$ asmcmd
ASMCMD> cp /home/grid/backup/orapwasm.bak +dg_ocr/orapwasm
copying /home/grid/backup/orapwasm.bak -> +dg_ocr/orapwasm
ASMCMD> exit
步骤5:校验密码文件内部用户权限
[grid@rac1 ~]$ asmcmd lspwusr
输出内容和故障发生前基准状态完全一致,所有用户、权限完整无丢失,无需额外授权。
步骤6:重启CRS验证
# root两节点执行
crsctl stop crs
crsctl start crs
集群所有资源自动启动,无需任何凭证修复、无需手动启动ASM,恢复速度最快,生产环境首选方案。
实验三:GI 补丁升级credfix修复
实验环境
-
基线GI版本:19.3.0.0.0(原生无credverify/credfix修复命令)
-
升级补丁包:升级后GI版本:19.30.0.0.0
-
前置条件:无ASM密码文件备份,允许停机打GI补丁
步骤1:验证低版本无修复命令
[grid@rac1 ~]$ asmcmd --nocp credverify
ASMCMD-8022: unknown command 'credverify' specified
步骤2:模拟故障+临时应急
-
删除ASM密码文件:
asmcmd rm -rf +DG_OCR/orapwasm -
root重启CRS复现ORA-01017故障
-
grid手动执行startup拉起双节点ASM实例
步骤3:节点1执行GI离线补丁升级
提前关闭数据库减少补丁冲突
[grid@rac1 ~]$ srvctl stop database -d orcl [root@rac1 ~]$ /u01/app/19.3.0/grid/OPatch/opatchauto apply /soft/34130714 -oh /u01/app/19.3.0/grid
步骤4:断点续打补丁
/u01/app/19.3.0/grid/OPatch/opatchauto resume
GI内核版本升级至19.30,credverify、credfix命令正常可用。
步骤5:节点2同步执行补丁升级+resume续打操作
步骤6:补全ASM用户权限
asmcmd orapwusr --grant sysasm SYS asmcmd orapwusr --add ASMSNMP asmcmd orapwusr --grant sysdba ASMSNMP
步骤7:root互信后执行credfix修复凭证(完全复用实验一命令)
步骤8:双节点再次resume收尾补丁,完成完整补丁安装
步骤9:集群及业务恢复验证
crsctl stop crs crsctl start crs srvctl start database -d orcl
集群、数据库、监听资源全部正常上线,故障彻底修复。
实验四:无备份+无法打补丁,手动重建ASM密码文件(兜底应急方案)
4.1 实验环境
-
GI版本:19.3.0.0.0,无credfix修复命令
-
限制条件:无任何密码文件备份、业务7*24运行,无法停机升级GI补丁
-
方案原理:导出OCR集群注册表,提取集群内置账号明文密码,手动补齐全部权限
4.2 步骤1:制造故障+临时应急
-
grid删除密码文件:
asmcmd rm -rf +DG_OCR/orapwasm -
root重启CRS复现认证故障
-
grid手动sqlplus启动双节点ASM实例
4.3 步骤2:新建空白ASM密码文件
orapwd file='+dg_ocr/orapwasm' asm=y force=y password=Password123*
4.4 步骤3:导出OCR转储文件,提取集群账号密钥
执行身份:grid@rac1
/u01/app/19.3.0/grid/bin/ocrdump /tmp/ocr.dmp vi /tmp/ocr.dmp
检索关键字:CRSUSER__ASM,获取32位密钥串:
[SYSTEM.ASM.CREDENTIALS,USERS.CRSUSER_ASM_007 ORATEXT : 94fe808e905aef90bf335f143d8fa6f5]
4.5 步骤4:通过密钥查询CRSUSER账号明文密码
crsctl get credmaint -path /ASM/Self/94fe808e905aef90bf335f143d8fa6f5 -credtype userpass -id 0 -attr passwd -local
# 输出明文密码(现场实测结果)
q19CaqwsMFZV1rW2kx3bTRWuazidD
4.6 步骤5:手动创建全部用户并精准分配权限
# 补齐SYS核心权限
asmcmd orapwusr --grant sysasm SYS
# 新增监控用户
asmcmd orapwusr --add ASMSNMP
asmcmd orapwusr --grant sysdba ASMSNMP
# 新增集群内部认证账号,填入OCR获取的明文密码
asmcmd orapwusr --add CRSUSER__ASM_007
asmcmd orapwusr --grant sysdba CRSUSER__ASM_007
asmcmd orapwusr --grant sysasm CRSUSER__ASM_007
4.7 步骤6:权限校验
[grid@rac1 ~]$ asmcmd lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE ASMSNMP TRUE FALSE FALSE CRSUSER__ASM_007 TRUE FALSE TRUE
4.8 步骤7:全节点重启CRS验证
crsctl stop crs crsctl start crs
集群所有资源自动上线,无需人工干预ASM启动,故障永久修复。




