问题
- 执行 antdb -> panwei 全量迁移时,数据库发生主备切换。
分析
- 使用 gs_om -t status --detail 命令,确认当前集群状态正常,原主1节点降备,3节点升主。
- 查看 key_event 日志,确定切换时间为 11-06 15:42:08,切换动作为 switchover;
2024-11-06 15:42:08.609 tid=4050066 CTL_WORKER: [KeyEvent: KEY_EVENT_SWITCHOVER] [Instance: 6003] [Details: send switchover message, node=3, instance=6003]
- 使用 df -h 检查当前存储空间,显示占用 53%
- 查看切换时间点的各个日志,在 cm_server 日志中发现了线索,可知1节点在切换时间点,因磁盘占用达到 85% 阈值变成只读,自动切主:
2024-11-06 15:41:52.905 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [logDisk usage] Pre Alarm threshold reached, node=1, usage=84.
2024-11-06 15:41:52.905 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [dataDisk usage] Pre Alarm threshold reached, instanceId=6001, usage=84
2024-11-06 15:42:02.908 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [logDisk usage] Pre Alarm threshold reached, node=1, usage=85.
2024-11-06 15:42:02.908 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [dataDisk usage] Pre Alarm threshold reached, instanceId=6001, usage=85
2024-11-06 15:42:02.908 tid=4050078 StorageDetect LOG: [ReadOnlyActSetDdbTo1] instance 6001 is not read only and ddb is 0, need set ddb to 1, disk_usage:85, read_only_threshold:85
2024-11-06 15:42:04.910 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [logDisk usage] Pre Alarm threshold reached, node=1, usage=85.
2024-11-06 15:42:04.910 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [dataDisk usage] Pre Alarm threshold reached, instanceId=6001, usage=85
2024-11-06 15:42:04.910 tid=4050078 StorageDetect LOG: [ReadOnlyActSetReadOnlyOn] instance 6001 is not read only and ddb is 1, set default_transaction_read_only on, disk_usage:85, read_only_threshold:85
2024-11-06 15:42:04.983 tid=4050078 StorageDetect LOG: [ReadOnlyActSetReadOnlyOn] instance 6001 set default_transaction_read_only on is success
2024-11-06 15:42:06.986 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [logDisk usage] Pre Alarm threshold reached, node=1, usage=85.
2024-11-06 15:42:06.986 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [dataDisk usage] Pre Alarm threshold reached, instanceId=6001, usage=85
2024-11-06 15:42:06.986 tid=4050078 StorageDetect LOG: [ReadOnlyActDoNoting] instance 6001 is transaction read only, disk_usage:85, read_only_threshold:85
2024-11-06 15:42:07.634 tid=4050067 CTL_WORKER LOG: [IsReadOnlyFinalState] instanceId: 6001 is in read only final state
2024-11-06 15:42:07.634 tid=4050067 CTL_WORKER LOG: [IsReadOnlyFinalState] instanceId: 6002 is in read only final state
2024-11-06 15:42:07.634 tid=4050067 CTL_WORKER LOG: [IsReadOnlyFinalState] instanceId: 6003 is in read only final state
2024-11-06 15:42:07.635 tid=4050067 CTL_WORKER LOG: [Primary], instanceId(0: 6001), mode is 1, find the best candicate is 2, primary Idx is [static: 0:1, dynamic: 0:1, dynormal: 0:1, vaildPrim: 0, demoting: 0], isReduced is [isReduced: 0, vaildCandiCount: 0, vaildCount: 3, onlineCount:3], sameAz is [3: 0], lock msg is [lock1: 0, lock2: 0, redoFinish: [local: 1, group: 0]], arbitrateTime is [local: 1, max: 0, delay is 0], termAndLsn is [InCond:[max: (303, 53A/11CFF340), local: (303, 226/D40000C8)], noCond:[term: 303], group: 303], listStr is [curSync: [sync list is empty], expSync: [sync list is empty], voteAz: [dynamic status is empty]], cascade is [sta: [insInfo is empty], dy: [insInfo is empty]]localMsg is [dbState: 1=Normal, maxSendTime: 0, dbRestart: 0, buildReason: 0=Normal, disconn is [mode: 1=polling_connection, host: , port: 0]], azIndex is [cur: 0, master: 1, slave: 4294967295, arbiter: 4294967295] azName is jzyx, minorityAzName is (null).
2024-11-06 15:42:07.635 tid=4050067 CTL_WORKER LOG: [Primary]: the primary dn(6001) restarts count: 0 in 10 min, 0 in hour, has delay timeout(0).
2024-11-06 15:42:07.635 tid=4050067 CTL_WORKER LOG: [Primary]: line 68: current report instance is 6001, node 1, instanceId 6001, local_static_role 1=Primary, local_dynamic_role 1=Primary, local_term=303, local_last_xlog_location=226/D40000C8, local_db_state 1=Normal, local_sync_state=5, build_reason 0=Normal, double_restarting=0, disconn_mode 1=polling_connection, disconn_host=, disconn_port=0, local_host=10.14.103.49, local_port=5433, redo_finished=1, peer_state=0, sync_mode=0, current_cluster_az_status=0, validCount=3, finishRedo=0, group_term=303, curSyncList is [sync list is empty], expectSyncList is [sync list is empty], voteAzList is [dynamic status is empty], arbitrate_time is 0, sendFailoverTimes=0.
2024-11-06 15:42:07.635 tid=4050067 CTL_WORKER LOG: [Primary]: line 68: current report instance is 6001, node 2, instanceId 6002, local_static_role 2=Standby, local_dynamic_role 2=Standby, local_term=303, local_last_xlog_location=514/CAFFDF30, local_db_state 1=Normal, local_sync_state=0, build_reason 0=Normal, double_restarting=0, disconn_mode 1=polling_connection, disconn_host=, disconn_port=0, local_host=10.14.136.58, local_port=5433, redo_finished=0, peer_state=1, sync_mode=0, current_cluster_az_status=0, validCount=3, finishRedo=0, group_term=303, curSyncList is [sync list is empty], expectSyncList is [sync list is empty], voteAzList is [dynamic status is empty], arbitrate_time is 0, sendFailoverTimes=0.
2024-11-06 15:42:07.635 tid=4050067 CTL_WORKER LOG: [Primary]: line 68: current report instance is 6001, node 3, instanceId 6003, local_static_role 2=Standby, local_dynamic_role 2=Standby, local_term=303, local_last_xlog_location=53A/11CFF340, local_db_state 1=Normal, local_sync_state=0, build_reason 0=Normal, double_restarting=0, disconn_mode 1=polling_connection, disconn_host=, disconn_port=0, local_host=10.14.136.60, local_port=5433, redo_finished=0, peer_state=1, sync_mode=0, current_cluster_az_status=0, validCount=3, finishRedo=0, group_term=303, curSyncList is [sync list is empty], expectSyncList is [sync list is empty], voteAzList is [dynamic status is empty], arbitrate_time is 0, sendFailoverTimes=0.
2024-11-06 15:42:07.635 tid=4050067 CTL_WORKER LOG: [Primary], DN(6003) will automatically switchover.
2024-11-06 15:42:08.609 tid=4050066 CTL_WORKER LOG: send switchover to instance(6003) for [1/4] times.
2024-11-06 15:42:08.609 tid=4050066 CTL_WORKER LOG: [KeyEvent: KEY_EVENT_SWITCHOVER] [Instance: 6003] [Details: send switchover message, node=3, instance=6003]
2024-11-06 15:42:08.609 tid=4050063 IO_WORKER LOG: cmserver send msg to node 3, msgtype: MSG_CM_AGENT_SWITCHOVER
- 切主后磁盘占用不断下降,到 84% 时,自动将 default_transaction_read_only 重新设置为 off:
2024-11-06 15:44:07.023 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [logDisk usage] Pre Alarm threshold reached, node=1, usage=84.
2024-11-06 15:44:07.023 tid=4050078 StorageDetect LOG: [PreAlarmForNodeThreshold] [dataDisk usage] Pre Alarm threshold reached, instanceId=6001, usage=84
2024-11-06 15:44:07.023 tid=4050078 StorageDetect LOG: [ReadOnlyActSetReadOnlyOff] instance 6001 is read only and ddb is 1, set default_transaction_read_only off, disk_usage:84, read_only_threshold:85
2024-11-06 15:44:07.093 tid=4050078 StorageDetect LOG: [ReadOnlyActSetReadOnlyOff] instance 6001 set default_transaction_read_only off is success
- 查看 datastorage_threshold_value_check 参数值,三个节点均为 85:
$ cm_ctl list --param --server | grep 'datastorage_threshold_value_check'
datastorage_threshold_value_check = 85
datastorage_threshold_value_check = 85
datastorage_threshold_value_check = 85
关于 default_transaction_read_only 和 transaction_read_only
在 openGausss 文档中是这样介绍的:
- default_transaction_read_only
设置每个新创建事务是否是只读状态;
USERSET类型。
- transaction_read_only
设置当前事务是只读事务。
该参数在数据库恢复过程中或者在备机里,固定为on;否则,固定为default_transaction_read_only的值;
USERSET类型。
在正常的一主两备集群中,default_transaction_read_only 均为 off;主库 transaction_read_only 为 off,备库为 on。
原因
磁盘达到 cm_server 参数 datastorage_threshold_value_check 控制的 85% 阈值,参数 default_transaction_read_only 和 transaction_read_only 变为 on,原主变成只读,集群自动切主。
解决
在openGauss官方文档中对该问题有详细介绍(详见参考链接),提供了清理空间、关闭容量检测、增大容量检测阈值等解决方法。
在当前环境,1节点有很多归档文件,新主3节点剩余空间较多。所以可使用3节点进行全量迁移,或者清理1节点不需要的文件后再使用1节点迁移,并检查当前磁盘容量可否承载源库数据。
参考
OG文档 - 因本地盘空间不足导致节点进入ReadOnly状态的问题
最后修改时间:2024-11-15 22:16:44
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




