暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

RAC环境使用pfile将一个节点启动到nomount状态遭遇ora-29702

原创 陶皓博 2020-07-24
3277

背景

最近客户这边上了一台X8一体机,需要将原先一套RAC的UAT测试数据库迁移到X8四节点集群上。数据库体量不大,约为700GB,因此采用DG的方式同步。

操作环境

X8上安装有两个数据库版本:19c(19.0.0.7)及11g(11.2.0.4)。19c作为主环境,管理着集群件。

问题

在使用备库pfile进行nomount11g数据库时发生ora-29702错误。告警日志信息及lmon进程的trace文件如下:
alert.ora

Mon Jul 20 11:21:14 2020
Adjusting the default value of parameter parallel_max_servers
from 3600 to 2970 due to the value of parameter processes (3000)
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED

Total Shared Global Region in Large Pages = 20 GB (100%)

Large Pages used by this instance: 10241 (20 GB)
Large Pages unused system wide = 13650 (27 GB)
Large Pages configured system wide = 49987 (98 GB)
Large Page size = 2048 KB


LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 96
Number of processor cores in the system is 48
Number of processor sockets in the system is 2

Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
NUMA system with 2 nodes detected
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /u01/app/oracle/product/11.2.0.4/dbhome_1
System name: Linux
Node name: …
Release: 4.14.35-1902.301.1.el7uek.x86_64
Version: #2 SMP Tue Mar 31 16:50:32 PDT 2020
Machine: x86_64
Using parameter settings in client-side pfile /u01/app/oracle/product/11.2.0.4/dbhome_1/dbs/p.ora on machine hzsmtdbadm01.mflex.com.cn
System parameters with non-default values:
processes = 3000
sessions = 4608
sga_max_size = 20G
sga_target = 20G
control_files = “+DATAC1/…x8/controlfile/control01.ctl”
db_file_name_convert = “+DATAC1/…/DATAFILE”
db_file_name_convert = “+DATAC1/…X8/DATAFILE”
log_file_name_convert = “+DATAC1/…/ONLINELOG”
log_file_name_convert = “+DATAC1/…X8/ONLINELOG”
db_block_size = 8192
compatible = “11.2.0.4.0”
log_archive_dest_1 = “location=+DATAC1/…X8”
log_archive_dest_2 = “SERVICE=…DG ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=…DG”
log_archive_dest_3 = “SERVICE=YCWIPUAT ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=…”
log_archive_dest_state_1 = “ENABLE”
log_archive_dest_state_2 = “ENABLE”
log_archive_dest_state_3 = “ENABLE”
fal_client = “…X8”
fal_server = “…,…DG”
log_archive_config = “DG_CONFIG=(…,…DG,…X8)”
log_archive_max_processes= 30
log_buffer = 134184960
cluster_database = TRUE
db_create_file_dest = “+DATAC1”
standby_file_management = “AUTO”
thread = 1
undo_tablespace = “UNDOTBS1”
instance_number = 1
remote_login_passwordfile= “EXCLUSIVE”
audit_sys_operations = TRUE
db_domain = “”
dispatchers = “(PROTOCOL=TCP) (SERVICE=…X8XDB)”
remote_listener = “hzsmt-scan1:1521”
session_cached_cursors = 100
audit_file_dest = “/u01/app/oracle/admin/…X8/adump”
audit_trail = “DB”
db_name = “…”
db_unique_name = “…X8”
open_cursors = 3000
pga_aggregate_target = 8G
diagnostic_dest = “/u01/app/oracle”
Cluster communication is configured to use the following interface(s) for this instance
192.168.10.50
192.168.10.51
cluster interconnect IPC version:Oracle RDS/IP (generic)
IPC Vendor 1 proto 3
Version 4.1
Mon Jul 20 11:21:21 2020
PMON started with pid=2, OS id=324275
Mon Jul 20 11:21:21 2020
PSP0 started with pid=3, OS id=324277
Mon Jul 20 11:21:22 2020
VKTM started with pid=4, OS id=324292 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
Mon Jul 20 11:21:22 2020
GEN0 started with pid=5, OS id=324296
Mon Jul 20 11:21:22 2020
DIAG started with pid=6, OS id=324298
Mon Jul 20 11:21:22 2020
DBRM started with pid=7, OS id=324300
Mon Jul 20 11:21:22 2020
PING started with pid=8, OS id=324302
Mon Jul 20 11:21:22 2020
ACMS started with pid=9, OS id=324304
Mon Jul 20 11:21:22 2020
DSKM started with pid=10, OS id=324306
Mon Jul 20 11:21:22 2020
DIA0 started with pid=11, OS id=324308
Mon Jul 20 11:21:22 2020
LMON started with pid=12, OS id=324310
Mon Jul 20 11:21:22 2020
LMD0 started with pid=13, OS id=324312

  • Load Monitor used for high load check
  • New Low - High Load Threshold Range = [92160 - 122880]
    Mon Jul 20 11:21:22 2020
    LMS0 started with pid=14, OS id=324315 at elevated priority
    Mon Jul 20 11:21:22 2020
    LMS1 started with pid=15, OS id=324319 at elevated priority
    Mon Jul 20 11:21:22 2020
    LMS2 started with pid=16, OS id=324323 at elevated priority
    Mon Jul 20 11:21:22 2020
    LMS3 started with pid=17, OS id=324327 at elevated priority
    Mon Jul 20 11:21:22 2020
    LMS4 started with pid=18, OS id=324331 at elevated priority
    Mon Jul 20 11:21:22 2020
    RMS0 started with pid=19, OS id=324335
    Mon Jul 20 11:21:22 2020
    LMHB started with pid=20, OS id=324337
    Mon Jul 20 11:21:22 2020
    MMAN started with pid=21, OS id=324339
    Mon Jul 20 11:21:22 2020
    DBW0 started with pid=22, OS id=324341
    Mon Jul 20 11:21:22 2020
    DBW1 started with pid=23, OS id=324343
    Mon Jul 20 11:21:22 2020
    DBW2 started with pid=24, OS id=324345
    Mon Jul 20 11:21:22 2020
    DBW3 started with pid=25, OS id=324347
    Mon Jul 20 11:21:22 2020
    DBW4 started with pid=26, OS id=324349
    Mon Jul 20 11:21:22 2020
    DBW5 started with pid=27, OS id=324351
    Mon Jul 20 11:21:22 2020
    DBW6 started with pid=28, OS id=324353
    Mon Jul 20 11:21:22 2020
    DBW7 started with pid=29, OS id=324355
    Mon Jul 20 11:21:22 2020
    DBW8 started with pid=30, OS id=324357
    Mon Jul 20 11:21:22 2020
    DBW9 started with pid=31, OS id=324359
    Mon Jul 20 11:21:22 2020
    DBWa started with pid=32, OS id=324361
    Mon Jul 20 11:21:22 2020
    DBWb started with pid=33, OS id=324363
    Mon Jul 20 11:21:22 2020
    LGWR started with pid=34, OS id=324365
    Mon Jul 20 11:21:22 2020
    CKPT started with pid=35, OS id=324367
    Mon Jul 20 11:21:22 2020
    SMON started with pid=36, OS id=324370
    Mon Jul 20 11:21:22 2020
    RECO started with pid=37, OS id=324372
    Mon Jul 20 11:21:22 2020
    RBAL started with pid=38, OS id=324374
    Mon Jul 20 11:21:22 2020
    ASMB started with pid=39, OS id=324376
    Mon Jul 20 11:21:22 2020
    MMON started with pid=40, OS id=324378
    Mon Jul 20 11:21:22 2020
    MMNL started with pid=41, OS id=324380
    starting up 1 dispatcher(s) for network address ‘(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))’…
    starting up 1 shared server(s) …
    NOTE: initiating MARK startup
    Starting background process MARK
    Mon Jul 20 11:21:22 2020
    MARK started with pid=44, OS id=324388
    NOTE: MARK has subscribed
    USER (ospid: 321859): terminating the instance due to error 29702
    Instance terminated by USER, pid = 321859

Lmon进程的trace文件如下:
GES resources 123378 pool 96
GES enqueues 170407
GES IPC: Receivers 6 Senders 6
GES IPC: Buffers Receive 1000 Send (i:6730 b:6730) Reserve 1000
GES IPC: Msg Size Regular 1176 Batch 8376
Batching factor: enqueue replay 206, ack 229
Batching factor: cache replay 114 size per lock 72
kjxggin: CGS tickets = 1000
kjxgrdmpcpu: CPU Total 96 Core 48 Socket 2 OCPU 96
kjxgrdmpcpu: High load threshold 122880
2020-07-20 11:21:22.699: [ CSSCLNT]clssgsGroupJoin: bad server response(-24)
kgxgnreg: error: status 1 (0 )
kjxgmjoin: can not join the group (DB…X8) with id 0 (inst 1)
kjxgmjoin: kgxgn error 16
2020-07-20 11:21:22.699495 : IMR recording device closed, terminating IMR
kjfmreg: Joining the cluster failed with err code 16

*** 2020-07-20 11:21:22.710
Global Enqueue Service Shutdown

解决方案

通过检查告警日志报错发现数据库无法加入到cluster中因此发生Instance terminated
通过检查Lmon进程的trace文件可以看到err code 16,通过对该error code的调查,得出导致该错误的原因是Oracle的socket文件有问题。所以需要冷关闭集群,并删除socket文件,再启动集群。操作如下:
关闭数据库
cd /u01/app/19*/grid/bin
./srvctl stop database -d db1
./srvctl stop database -d db2

关闭集群件
150-153
cd /u01/app/19*/grid/bin
./crsctl stop crs -f
删除socket文件
150-153
rm -rf /var/tmp/.oracle/*
rm -rf /tmp/.oracle/*
rm -rf /usr/tmp/.oracle/*

启动集群件
150-153
cd /u01/app/19*/grid/bin
./crsctl start crs

启动数据库
cd /u01/app/19*/grid/bin
./srvctl start database -d db1
./srvctl start database -d db2

最后再使用pfile将一个节点启动到nomount状态,成功。问题解决!撒花~~

最后修改时间:2020-07-24 10:09:38
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论