ADG备库Recovery节点异常重启问题

IT那活儿 2024-04-09

690

点击上方“IT那活儿”公众号--专注于企业全栈运维技术分享，不管IT什么活儿，干就完了！！！

问题描述

近一个月，频繁出现核心某库（ADG备库）部分节点数据库实例发生异常重启问题。

分析定位

2.1 检查数据库日志

发现实例异常重启都发生在 Media Recovery 节点，且日志信息基本相同。

即 ORA-04031 发生，LMS 进程被终止，实例因此 crash。

2023-12-30T00:34:27.339420+08:00
Errors in file u01/app/oracle/diag/rdbms/XXX_a/XXX5/trace/XXX5_lmse_112014_112041.trc:
ORA-04031: unable to allocate 168 bytes of shared memory ("shared pool","unknown object","sga heap(2,0)","gcs dynamic shadows lms")
2023-12-30T00:34:28.136580+08:00
opidrv aborting process LMSE ospid (112014_112041) as a result of ORA-4031
2023-12-30T00:34:28.136676+08:00
Errors in file u01/app/oracle/diag/rdbms/XXX_a/XXX5/trace/XXX5_lmse_112014_112041.trc:
ORA-04031: unable to allocate 168 bytes of shared memory ("shared pool","unknown object","sga heap(2,0)","gcs dynamic shadows lms")
2023-12-30T00:34:28.157891+08:00
System state dump requested by (instance=5, osid=111863 (PMON)), summary=[abnormal instance termination]. error - 'Instance is terminating.
'
System State dumped to trace file /u01/app/oracle/diag/rdbms/XXX_a/XXX5/trace/XXX5_diag_111930.trc
2023-12-30T00:34:28.180572+08:00
PMON (ospid: 111863): terminating the instance due to ORA error 12752
2023-12-30T00:34:28.180698+08:00
Cause - 'Instance is being terminated due to fatal process death (pid: 52, ospid: 112014_112041, LMSE)'
2023-12-30T00:34:29.028197+08:00
ORA-1092 : opitsk aborting process
2023-12-30T00:34:29.290269+08:00
Non critical error ORA-48913 caught while writing to trace file "/u01/app/oracle/diag/rdbms/XXX_a/XXX5/trace/XXX5_diag_111930.trc"
Error message: ORA-48913: Writing into trace file failed, file size limit [10485760] reached
Writing to the above trace file is disabled for now...
2023-12-30T00:34:29.554757+08:00
ORA-1092 : opitsk aborting process
2023-12-30T00:34:30.227296+08:00
License high water mark = 143
2023-12-30T00:34:34.205214+08:00
Instance terminated by PMON, pid = 111863
2023-12-30T00:34:34.763673+08:00
Warning: 2 processes are still attacheded to shmid 341147706:
(size: 94208 bytes, creator pid: 111091, last attach/detach pid: 111927)
2023-12-30T00:34:35.228955+08:00
USER(prelim) (ospid: 294282): terminating the instance
2023-12-30T00:34:35.231483+08:00
Instance terminated by USER(prelim), pid = 294282
2023-12-30T00:34:37.267130+08:00
Starting ORACLE instance (normal) (OS id: 294729)

2.2 进一步检查 incident 文件

发现相关 SUBPOOL 的 “gcs dynamic shadows lms” 组件占用异常，远高于其他组件，甚至高于负载较重的主库。

==============================================
TOP 10 MEMORY USES FOR SGA HEAP SUB POOL 2
----------------------------------------------
"gcs dynamic shadows lms "  8331 MB 51%
"free memory "  4126 MB 25%
"gcs resources "  1010 MB 6%
"gcs dynamic resources "   893 MB 5%
"gcs resources "  1010 MB 6%
"gcs dynamic resources "   893 MB 5%
"gcs shadows "   551 MB 3%
"gcs resv res hash bucket "   356 MB 2%
"gcs dynamic resources for "   326 MB 2%
"db_block_hash_buckets "   144 MB 1%
"gc name table "   128 MB 1%
"file queue buckets "    86 MB 1%
-----------------------------------------
free memory 4126 MB
memory alloc. 12 GB
Sub total 16 GB
==============================================
TOP 10 MAXIMUM MEMORY USES FOR SGA HEAP SUB POOL 2
----------------------------------------------
"gcs dynamic shadows lms "  8331 MB
"free memory "  4484 MB
"gcs resources "  1010 MB
"gcs dynamic resources "   893 MB
"gcs shadows "   551 MB
"gcs resv res hash bucket "   356 MB
"gcs dynamic resources for "   326 MB
"db_block_hash_buckets "   144 MB
"gc name table "   128 MB
"file queue buckets "    86 MB
==============================================

2.3 检查数据库与共享池相关的优化参数

确认都已按照规范进行了优化配置，因此怀疑可能是未知 BUG 引发，提交 SR 进行确认，但并没有完全匹配的案例，ORACLE 官方给出的解决方案是尝试应用较新的补丁或恢复相关隐含参数为默认值（启用 DRM）。

根据现场情况并结合业务进行评估，两种方案都缺乏明确的依据，且会引入更大风险，因此暂时被排除。

2.4 进一步监测和分析，重点探查与 GCS 相关的组件或资源

最终发现在 RECOVERY 实例上 gcs_resources 和 gcs_shadows 占用远高于初始值，这很可能会导致共享池相关组件动态增长过高，导致碎片化。

INST_ID RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
---------- ------------------------- -------------------- -------------------- -------------------- -----------
1 gcs_resources 650 8,892,902 28128315 UNLIMITED
1 gcs_shadows 821 10,811,451 28128315 UNLIMITED
2 gcs_resources 518 8,713,305 28128315 UNLIMITED
2 gcs_shadows 815 12,490,772 28128315 UNLIMITED
3 gcs_resources 588 8,842,600 28128315 UNLIMITED
3 gcs_shadows 878 12,587,568 28128315 UNLIMITED
4 gcs_resources 1,040 1,040 28128315 UNLIMITED
4 gcs_shadows 1,040 1,040 28128315 UNLIMITED
5 gcs_resources 585 63,813,068 28128315 UNLIMITED
5 gcs_shadows 841 80,459,548 28128315 UNLIMITED
6 gcs_resources 52,055,988 55,164,600 28128315 UNLIMITED
6 gcs_shadows 49,333,039 50,507,220 28128315 UNLIMITED

总结及处理

3.1 问题总结

LMS 进程做为 RAC 关键进程，是 GCS 服务的重要实现者，由于无法及时在共享池中分配到内存，导致进程异常终止，数据库因此崩溃。

Active Dataguard 备库 Media Recovery 只会在一个节点执行，月末月初主库负载加重，相应传输至备库的日志也会增多，同时备库在此期间还有历史库抽取数据操作，很可能导致 GCS 相关资源远高于初始值，共享池相关组件内存因此动态增长显著,更加容易碎片化。

3.2 解决方法

1）ORACLE 官方

尝试应用较新的补丁，或者恢复部分涉及 DRM 功能的隐含参数的设置（风险较高，暂时排除）。

2）现场方案

调整备库GCS 资源相关隐含参数，监测情况是否发生改善。

END

本文作者：任我行(上海新炬中北团队）

本文来源：“IT那活儿”公众号

文章转载自IT那活儿，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

ADG备库Recovery节点异常重启问题

本文作者：任我行(上海新炬中北团队）

本文来源：“IT那活儿”公众号

评论