暂无图片
暂无图片
3
暂无图片
暂无图片
暂无图片

数据库管理-第244期 一次无法switchover的故障处理(20240928)

原创 胖头鱼的鱼缸 2024-09-28
788

数据库管理244期 2024-09-28

数据库管理-第244期 一次无法switchover的故障处理(20240928)

作者:胖头鱼的鱼缸(尹海文)
Oracle ACE Pro: Database(Oracle与MySQL)
PostgreSQL ACE Partner
10年数据库行业经验,现主要从事数据库服务工作
拥有OCM 11g/12c/19c、MySQL 8.0 OCP、Exadata、CDP等认证
墨天轮MVP、年度墨力之星,ITPUB认证专家、专家百人团成员,OCM讲师,PolarDB开源社区技术顾问,HaloDB外聘技术顾问,OceanBase观察团成员,青学会MOP技术社区(青年数据库学习互助会)技术顾问
圈内拥有“总监”、“保安”、“国产数据库最大敌人”等称号,非著名社恐(社交恐怖分子)
公众号:胖头鱼的鱼缸;CSDN:胖头鱼的鱼缸(尹海文);墨天轮:胖头鱼的鱼缸;ITPUB:yhw1809。
除授权转载并标明出处外,均为“非法”抄袭

演示文稿1_01.png
中秋前做了一次数据库的倒换演练,结果发现无法switchover,本期就来跟随总监一步一步的寻找并解决问题(同时感谢SR支持)。

1 问题展现

在DGMGRL中进行switchover的时候出现了下面的问题:

DGMGRL> switchover to dbdg; Performing switchover NOW, please wait... Error: ORA-16775: target standby database in broker operation has potential data loss Failed. Unable to switchover, primary database is still "dbaas"

2 问题排查与处理

2.1 问题1

第一个问题呢是在DGMGRL中show configuration:

DGMGRL> show configuration Configuration - dg Protection Mode: MaxPerformance Members: dbaas - Primary database dbdg - Physical standby database Error: ORA-16664: unable to receive the result from a member Fast-Start Failover: Disabled Configuration Status: ERROR (status updated 34 seconds ago)

但是使用show configuration verbose则是显示正常,查看主备库也是正常的:

DGMGRL> show configuration verbose Configuration - dg Protection Mode: MaxPerformance Members: dbaas - Primary database dbdg - Physical standby database Properties: FastStartFailoverThreshold = '30' OperationTimeout = '30' TraceLevel = 'SUPPORT' FastStartFailoverLagLimit = '30' CommunicationTimeout = '180' ObserverReconnect = '0' FastStartFailoverAutoReinstate = 'TRUE' FastStartFailoverPmyShutdown = 'TRUE' BystandersFollowRoleChange = 'ALL' ObserverOverride = 'FALSE' ExternalDestination1 = '' ExternalDestination2 = '' PrimaryLostWriteAction = 'CONTINUE' ConfigurationWideServiceName = 'dbaas_CFG' Fast-Start Failover: Disabled Configuration Status: SUCCESS DGMGRL> show database dbaas Database - dbaas Role: PRIMARY Intended State: TRANSPORT-ON Instance(s): dbaas1 dbaas2 Database Status: SUCCESS DGMGRL> show database dbdg Database - dbdg Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: 0 seconds (computed 0 seconds ago) Apply Lag: 0 seconds (computed 0 seconds ago) Average Apply Rate: 61.30 MByte/s Real Time Query: ON Instance(s): dbdg1 dbdg2 (apply instance) dbdg3 dbdg4 Database Status: SUCCESS

最终发现是数据库本身是配置了db_domain的,而tnsname和静态监听中未配置domain(即域名后缀,如xxx.com),随机调整监听和tnsname:
监听调整(以主库实例1和listener_scan1为例):

SID_LIST_LISTENER = (SID_LIST = (SID_DESC = (GLOBAL_DBNAME = dbaas.scmcc.com) #增加domain (ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1) (SID_NAME = dbaas1) ) (SID_DESC = (GLOBAL_DBNAME = dbaas_DGMGRL.scmcc.com) #增加domain (ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1) (SID_NAME = dbaas1) ) ) SID_LIST_LISTENER_SCAN1 = (SID_LIST = (SID_DESC = (GLOBAL_DBNAME =dbaas.scmcc.com) #增加domain (ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1) (SID_NAME = dbaas1) ) (SID_DESC = (GLOBAL_DBNAME = dbaas_DGMGRL.scmcc.com) #增加domain (ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1) (SID_NAME = dbaas1) ) )

完成后reload所有监听。
tnsname调整:

DBAAS = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = primary-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = dbaas.xxx.com) #增加domain ) ) DBDG = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = standby-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = dbdg.xxx.com) #增加domain ) )

完成后show configuration恢复正常:

DGMGRL> show configuration Configuration - dg Protection Mode: MaxPerformance Members: dbaas - Primary database dbdg - Physical standby database Fast-Start Failover: Disabled Configuration Status: SUCCESS (status updated 59 seconds ago)

2.2 问题2

接下来尝试switchover报错依然,validate备库发现一些问题:

DGMGRL> validate database verbose dbdg Database Role: Physical standby database Primary Database: dbaas Ready for Switchover: No #<---Here Ready for Failover: Yes (Primary Running) Flashback Database Status: dbaas: Off dbdg : Off Capacity Information: Database Instances Threads dbaas 2 4 dbdg 4 4 Managed by Clusterware: dbaas: YES dbdg : YES Temporary Tablespace File Information: dbaas TEMP Files: 56 dbdg TEMP Files: 67 Data file Online Move in Progress: dbaas: No dbdg: No Standby Apply-Related Information: Apply State: Running Apply Lag: 0 seconds (computed 0 seconds ago) Apply Delay: 0 minutes Transport-Related Information: Transport On: Yes Gap Status: Gap #<---Here Transport Lag: 0 seconds (computed 0 seconds ago) Transport Status: Success Log Files Cleared: dbaas Standby Redo Log Files: Cleared dbdg Online Redo Log Files: Cleared dbdg Standby Redo Log Files: Available Current Log File Groups Configuration: Thread # Online Redo Log Groups Standby Redo Log Groups Status (dbaas) (dbdg) 2 4 7 Sufficient SRLs 1 4 7 Sufficient SRLs 3 4 7 Sufficient SRLs 4 4 7 Sufficient SRLs Future Log File Groups Configuration: Thread # Online Redo Log Groups Standby Redo Log Groups Status (dbdg) (dbaas) 2 4 7 Sufficient SRLs 1 4 7 Sufficient SRLs 3 4 7 Sufficient SRLs 4 4 7 Sufficient SRLs Current Configuration Log File Sizes: Thread # Smallest Online Redo Smallest Standby Redo Log File Size Log File Size (dbaas) (dbdg) 2 2048 MBytes 2048 MBytes 1 2048 MBytes 2048 MBytes 3 2048 MBytes 2048 MBytes 4 2048 MBytes 2048 MBytes Future Configuration Log File Sizes: Thread # Smallest Online Redo Smallest Standby Redo Log File Size Log File Size (dbdg) (dbaas) 2 2048 MBytes 2048 MBytes 1 2048 MBytes 2048 MBytes 3 2048 MBytes 2048 MBytes 4 2048 MBytes 2048 MBytes Apply-Related Property Settings: Property dbaas Value dbdg Value DelayMins 0 0 ApplyParallel AUTO AUTO ApplyInstances 0 0 Transport-Related Property Settings: Property dbaas Value dbdg Value LogShipping ON ON LogXptMode sync sync Dependency <empty> <empty> DelayMins 0 0 Binding optional optional MaxFailure 0 0 ReopenSecs 300 300 NetTimeout 30 30 RedoCompression DISABLE DISABLE

仍然显示无法switchover且存在GAP,但是通过数据库查询发现并未出现GAP(查询语句如下,结果略):

-- primary database set markup HTML on spool /tmp/primary_info.html ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MON-YYYY HH24:MI:SS'; select thread#, max(sequence#) "Last Primary Seq Generated" from gv$archived_log val, gv$database vdb where val.resetlogs_change# = vdb.resetlogs_change# group by thread# order by 1; SELECT thread#, dest_id, gvad.status, error, fail_sequence FROM gv$archive_dest gvad, gv$instance gvi WHERE gvad.inst_id = gvi.inst_id AND destination is NOT NULL ORDER BY thread#, dest_id; select * from gv$dataguard_stats; SELECT a.thread#, b. last_seq, a.applied_seq, a. last_app_timestamp, b.last_seq-a.applied_seq ARC_DIFF FROM (SELECT thread#, MAX(sequence#) applied_seq, MAX(next_time) last_app_timestamp FROM gv$archived_log WHERE applied = 'YES' GROUP BY thread#) a, (SELECT thread#, MAX (sequence#) last_seq FROM gv$archived_log GROUP BY thread#) b WHERE a.thread# = b.thread#; spool off set markup HTML off -- standby database set markup HTML on spool /tmp/standby_info.html ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MON-YYYY HH24:MI:SS'; select process,thread#,sequence#,status from gv$managed_standby; select * from v$dataguard_stats; select a.thread# ,a.sequence# ,a.group# grp , a.bytes/1024/1024 Size_MB ,a.status ,a.archived ,a.first_change# "First SCN Number" ,to_char(FIRST_TIME,'DD-Mon-RR HH24:MI:SS') "First SCN Time" ,to_char(LAST_TIME,'DD-Mon-RR HH24:MI:SS') "Last SCN Time" from v$standby_log a order by 1,2,3,4; select thread#, max(sequence#) "Last Standby Seq Received" from v$archived_log val, v$database vdb where val.resetlogs_change# = vdb.resetlogs_change# group by thread# order by 1; select thread#, max(sequence#) "Last Standby Seq Applied" from v$archived_log val, v$database vdb where val.resetlogs_change# = vdb.resetlogs_change# and val.applied in ('YES','IN-MEMORY') group by thread# order by 1; spool off set markup HTML off

主库检查thread:

SQL> SELECT thread#, instance, status FROM v$thread; THREAD# INSTANCE STATUS ---------- -------------------- ------ 1 dbaas1 OPEN 2 dbaas2 OPEN 3 UNNAMED_INSTANCE_3 CLOSED 4 UNNAMED_INSTANCE_4 CLOSED

这里后台建议做了一个操作:

ALTER DATABASE DISABLE THREAD 3; ALTER DATABASE DISABLE THREAD 4;

运行一段时间后再次validate备库:

DGMGRL> validate database verbose dbdg ... Ready for Switchover: Yes ... Gap Status: No Gap ...

显示可以切换且没有Gap了。目前还没有尝试再次switchover,但应该是没问题了。

3 问题分析

这里也是我第一次遇到这个问题,应该是一开始因为domain配置引起的元数据问题。关于thread的问题,因为主备节点数量不一致,但是其他类似配置的库并没有出现过相关问题,所以我怀疑还是和domain配置有问题带来的连锁反应。

4 总结

本期处理了一个ADG无法switchover的问题,源自于最早的错误配置。
老规矩,知道写了些啥。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论