RAC环境使用pfile将一个节点启动到nomount状态遭遇ora-29702

原创陶皓博 2020-07-24

3352

背景

最近客户这边上了一台X8一体机，需要将原先一套RAC的UAT测试数据库迁移到X8四节点集群上。数据库体量不大，约为700GB，因此采用DG的方式同步。

操作环境

X8上安装有两个数据库版本：19c(19.0.0.7)及11g(11.2.0.4)。19c作为主环境，管理着集群件。

问题

在使用备库pfile进行nomount11g数据库时发生ora-29702错误。告警日志信息及lmon进程的trace文件如下：
alert.ora

Mon Jul 20 11:21:14 2020
Adjusting the default value of parameter parallel_max_servers
from 3600 to 2970 due to the value of parameter processes (3000)
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED

Total Shared Global Region in Large Pages = 20 GB (100%)

Large Pages used by this instance: 10241 (20 GB)
Large Pages unused system wide = 13650 (27 GB)
Large Pages configured system wide = 49987 (98 GB)
Large Page size = 2048 KB

LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 96
Number of processor cores in the system is 48
Number of processor sockets in the system is 2
…
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
NUMA system with 2 nodes detected
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /u01/app/oracle/product/11.2.0.4/dbhome_1
System name: Linux
Node name: …
Release: 4.14.35-1902.301.1.el7uek.x86_64
Version: #2 SMP Tue Mar 31 16:50:32 PDT 2020
Machine: x86_64
Using parameter settings in client-side pfile /u01/app/oracle/product/11.2.0.4/dbhome_1/dbs/p.ora on machine hzsmtdbadm01.mflex.com.cn
System parameters with non-default values:
processes = 3000
sessions = 4608
sga_max_size = 20G
sga_target = 20G
control_files = “+DATAC1/…x8/controlfile/control01.ctl”
db_file_name_convert = “+DATAC1/…/DATAFILE”
db_file_name_convert = “+DATAC1/…X8/DATAFILE”
log_file_name_convert = “+DATAC1/…/ONLINELOG”
log_file_name_convert = “+DATAC1/…X8/ONLINELOG”
db_block_size = 8192
compatible = “11.2.0.4.0”
log_archive_dest_1 = “location=+DATAC1/…X8”
log_archive_dest_2 = “SERVICE=…DG ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=…DG”
log_archive_dest_3 = “SERVICE=YCWIPUAT ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=…”
log_archive_dest_state_1 = “ENABLE”
log_archive_dest_state_2 = “ENABLE”
log_archive_dest_state_3 = “ENABLE”
fal_client = “…X8”
fal_server = “…,…DG”
log_archive_config = “DG_CONFIG=(…,…DG,…X8)”
log_archive_max_processes= 30
log_buffer = 134184960
cluster_database = TRUE
db_create_file_dest = “+DATAC1”
standby_file_management = “AUTO”
thread = 1
undo_tablespace = “UNDOTBS1”
instance_number = 1
remote_login_passwordfile= “EXCLUSIVE”
audit_sys_operations = TRUE
db_domain = “”
dispatchers = “(PROTOCOL=TCP) (SERVICE=…X8XDB)”
remote_listener = “hzsmt-scan1:1521”
session_cached_cursors = 100
audit_file_dest = “/u01/app/oracle/admin/…X8/adump”
audit_trail = “DB”
db_name = “…”
db_unique_name = “…X8”
open_cursors = 3000
pga_aggregate_target = 8G
diagnostic_dest = “/u01/app/oracle”
Cluster communication is configured to use the following interface(s) for this instance
192.168.10.50
192.168.10.51
cluster interconnect IPC version:Oracle RDS/IP (generic)
IPC Vendor 1 proto 3
Version 4.1
Mon Jul 20 11:21:21 2020
PMON started with pid=2, OS id=324275
Mon Jul 20 11:21:21 2020
PSP0 started with pid=3, OS id=324277
Mon Jul 20 11:21:22 2020
VKTM started with pid=4, OS id=324292 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
Mon Jul 20 11:21:22 2020
GEN0 started with pid=5, OS id=324296
Mon Jul 20 11:21:22 2020
DIAG started with pid=6, OS id=324298
Mon Jul 20 11:21:22 2020
DBRM started with pid=7, OS id=324300
Mon Jul 20 11:21:22 2020
PING started with pid=8, OS id=324302
Mon Jul 20 11:21:22 2020
ACMS started with pid=9, OS id=324304
Mon Jul 20 11:21:22 2020
DSKM started with pid=10, OS id=324306
Mon Jul 20 11:21:22 2020
DIA0 started with pid=11, OS id=324308
Mon Jul 20 11:21:22 2020
LMON started with pid=12, OS id=324310
Mon Jul 20 11:21:22 2020
LMD0 started with pid=13, OS id=324312

Load Monitor used for high load check
New Low - High Load Threshold Range = [92160 - 122880]
Mon Jul 20 11:21:22 2020
LMS0 started with pid=14, OS id=324315 at elevated priority
Mon Jul 20 11:21:22 2020
LMS1 started with pid=15, OS id=324319 at elevated priority
Mon Jul 20 11:21:22 2020
LMS2 started with pid=16, OS id=324323 at elevated priority
Mon Jul 20 11:21:22 2020
LMS3 started with pid=17, OS id=324327 at elevated priority
Mon Jul 20 11:21:22 2020
LMS4 started with pid=18, OS id=324331 at elevated priority
Mon Jul 20 11:21:22 2020
RMS0 started with pid=19, OS id=324335
Mon Jul 20 11:21:22 2020
LMHB started with pid=20, OS id=324337
Mon Jul 20 11:21:22 2020
MMAN started with pid=21, OS id=324339
Mon Jul 20 11:21:22 2020
DBW0 started with pid=22, OS id=324341
Mon Jul 20 11:21:22 2020
DBW1 started with pid=23, OS id=324343
Mon Jul 20 11:21:22 2020
DBW2 started with pid=24, OS id=324345
Mon Jul 20 11:21:22 2020
DBW3 started with pid=25, OS id=324347
Mon Jul 20 11:21:22 2020
DBW4 started with pid=26, OS id=324349
Mon Jul 20 11:21:22 2020
DBW5 started with pid=27, OS id=324351
Mon Jul 20 11:21:22 2020
DBW6 started with pid=28, OS id=324353
Mon Jul 20 11:21:22 2020
DBW7 started with pid=29, OS id=324355
Mon Jul 20 11:21:22 2020
DBW8 started with pid=30, OS id=324357
Mon Jul 20 11:21:22 2020
DBW9 started with pid=31, OS id=324359
Mon Jul 20 11:21:22 2020
DBWa started with pid=32, OS id=324361
Mon Jul 20 11:21:22 2020
DBWb started with pid=33, OS id=324363
Mon Jul 20 11:21:22 2020
LGWR started with pid=34, OS id=324365
Mon Jul 20 11:21:22 2020
CKPT started with pid=35, OS id=324367
Mon Jul 20 11:21:22 2020
SMON started with pid=36, OS id=324370
Mon Jul 20 11:21:22 2020
RECO started with pid=37, OS id=324372
Mon Jul 20 11:21:22 2020
RBAL started with pid=38, OS id=324374
Mon Jul 20 11:21:22 2020
ASMB started with pid=39, OS id=324376
Mon Jul 20 11:21:22 2020
MMON started with pid=40, OS id=324378
Mon Jul 20 11:21:22 2020
MMNL started with pid=41, OS id=324380
starting up 1 dispatcher(s) for network address ‘(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))’…
starting up 1 shared server(s) …
NOTE: initiating MARK startup
Starting background process MARK
Mon Jul 20 11:21:22 2020
MARK started with pid=44, OS id=324388
NOTE: MARK has subscribed
USER (ospid: 321859): terminating the instance due to error 29702
Instance terminated by USER, pid = 321859

Lmon进程的trace文件如下：
GES resources 123378 pool 96
GES enqueues 170407
GES IPC: Receivers 6 Senders 6
GES IPC: Buffers Receive 1000 Send (i:6730 b:6730) Reserve 1000
GES IPC: Msg Size Regular 1176 Batch 8376
Batching factor: enqueue replay 206, ack 229
Batching factor: cache replay 114 size per lock 72
kjxggin: CGS tickets = 1000
kjxgrdmpcpu: CPU Total 96 Core 48 Socket 2 OCPU 96
kjxgrdmpcpu: High load threshold 122880
2020-07-20 11:21:22.699: [ CSSCLNT]clssgsGroupJoin: bad server response(-24)
kgxgnreg: error: status 1 (0 )
kjxgmjoin: can not join the group (DB…X8) with id 0 (inst 1)
kjxgmjoin: kgxgn error 16
2020-07-20 11:21:22.699495 : IMR recording device closed, terminating IMR
kjfmreg: Joining the cluster failed with err code 16

*** 2020-07-20 11:21:22.710
Global Enqueue Service Shutdown

解决方案

通过检查告警日志报错发现数据库无法加入到cluster中因此发生Instance terminated
通过检查Lmon进程的trace文件可以看到err code 16,通过对该error code的调查，得出导致该错误的原因是Oracle的socket文件有问题。所以需要冷关闭集群，并删除socket文件，再启动集群。操作如下：
关闭数据库
cd /u01/app/19*/grid/bin
./srvctl stop database -d db1
./srvctl stop database -d db2

关闭集群件
150-153
cd /u01/app/19*/grid/bin
./crsctl stop crs -f
删除socket文件
150-153
rm -rf /var/tmp/.oracle/*
rm -rf /tmp/.oracle/*
rm -rf /usr/tmp/.oracle/*

启动集群件
150-153
cd /u01/app/19*/grid/bin
./crsctl start crs

启动数据库
cd /u01/app/19*/grid/bin
./srvctl start database -d db1
./srvctl start database -d db2

最后再使用pfile将一个节点启动到nomount状态，成功。问题解决！撒花~~

oracle mha

最后修改时间：2020-07-24 10:09:38

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

RAC环境使用pfile将一个节点启动到nomount状态遭遇ora-29702

背景

操作环境

问题

解决方案

评论