问题:MogDB2.1.1三节点集群,备库异常:
[omm@node1 mogha]$ gs_om -t status --detail
[ Cluster State ]
redistributing : No
current_az : AZ_ALL
--------------------------------------------------------------------------------
1 node1 192.168.0.11 26000 6001 /mogdb/data/db1 P Primary Normal
2 node2 192.168.0.12 26000 6002 /mogdb/data/db1 S Standby Need repair(WAL)
3 node3 192.168.0.13 26000 6003 /mogdb/data/db1 S Standby Need repair(WAL)
[omm@node1 mogha]$
原因:
原因可能由于网络故障、磁盘满等原因造成主备实例连接断开,主备日志不同步,导致数据库在启动时异常。
解决:
需要在node2、node3上进行数据的rebuild。
操作步骤:
在有问题的备机node2 上操作:
[omm@node2 ~]$ gs_ctl build -D /mogdb/data/db1 -b node2
[2022-05-18 15:20:07.426][115888][][gs_ctl]: gs_ctl incremental build ,datadir is /mogdb/data/db1
waiting for server to shut down........ done
server stopped
[2022-05-18 15:20:12.463][115888][][gs_ctl]: fopen build pid file "/mogdb/data/db1/gs_build.pid" success
[2022-05-18 15:20:12.464][115888][][gs_ctl]: fprintf build pid file "/mogdb/data/db1/gs_build.pid" success
[2022-05-18 15:20:12.465][115888][][gs_ctl]: fsync build pid file "/mogdb/data/db1/gs_build.pid" success
[2022-05-18 15:20:12.469][115888][dn_6001_6002_6003][gs_rewind]: set gaussdb state file when rewind:db state(BUILDING_STATE), server mode(STANDBY_MODE), build mode(INC_BUILD).
[2022-05-18 15:20:12.517][115888][dn_6001_6002_6003][gs_rewind]: connected to server: host=192.168.0.11 port=26001 dbname=postgres application_name=gs_rewind connect_timeout=5 rw_timeout=600
[2022-05-18 15:20:12.527][115888][dn_6001_6002_6003][gs_rewind]: connect to primary success
[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: get pg_control success
[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: target server was interrupted in mode 2.
[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: sanityChecks success
[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: find last checkpoint at 0/5013C80 and checkpoint redo at 0/5013C00 from source control file
[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: find last checkpoint at 0/500B9A0 and checkpoint redo at 0/500B920 from target control file
[2022-05-18 15:20:12.532][115888][dn_6001_6002_6003][gs_rewind]: find max lsn success, find max lsn rec (0/500B9A0) success.
[2022-05-18 15:20:12.547][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B880 and its crc(source, target):[3740980668, 319030857]
[2022-05-18 15:20:12.572][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B760 and its crc(source, target):[1584992055, 762019665]
[2022-05-18 15:20:12.579][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B640 and its crc(source, target):[2946690191, 1650212730]
[2022-05-18 15:20:12.584][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B520 and its crc(source, target):[3009108653, 2629000000]
[2022-05-18 15:20:12.588][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B400 and its crc(source, target):[1968665500, 2528249068]
[2022-05-18 15:20:12.592][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B2E0 and its crc(source, target):[962672013, 3933412142]
[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B1C0 and its crc(source, target):[2948223938, 2948223938]
[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: find common checkpoint 0/500B1C0
[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: find diverge point success
[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: read checkpoint redo (0/500B140) success before rewinding.
[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: rewinding from checkpoint redo point at 0/500B140 on timeline 1
[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: diverge xlogfile is 000000010000000000000005, older ones will not be copied or removed.
[2022-05-18 15:20:12.598][115888][dn_6001_6002_6003][gs_rewind]: targetFileStatThread success pid 140717470623488.
[2022-05-18 15:20:12.598][115888][dn_6001_6002_6003][gs_rewind]: reading source file list
[2022-05-18 15:20:12.598][115888][dn_6001_6002_6003][gs_rewind]: traverse_datadir start.
[2022-05-18 15:20:12.611][115888][dn_6001_6002_6003][gs_rewind]: filemap_list_to_array start.
[2022-05-18 15:20:12.612][115888][dn_6001_6002_6003][gs_rewind]: filemap_list_to_array end sort start. length is 1378
[2022-05-18 15:20:12.612][115888][dn_6001_6002_6003][gs_rewind]: sort end.
[2022-05-18 15:20:12.615][115888][dn_6001_6002_6003][gs_rewind]: targetFileStatThread return success.
[2022-05-18 15:20:12.618][115888][dn_6001_6002_6003][gs_rewind]: reading target file list
[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: traverse target datadir success
[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: reading WAL in target
[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: could not read WAL record at 0/500BA40: invalid record length at 0/500BA40: wanted 32, got 0
[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: calculate totals rewind success
[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: need to copy 56MB (total source directory size is 86MB)
[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: starting background WAL receiver
[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: Starting copy xlog, start point: 0/500B140
[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: in gs_rewind proecess,so no need remove.
[2022-05-18 15:20:12.624][115888][dn_6001_6002_6003][gs_rewind]: check identify system success
[2022-05-18 15:20:12.625][115888][dn_6001_6002_6003][gs_rewind]: send START_REPLICATION 0/5000000 success
[2022-05-18 15:20:12.648][115888][dn_6001_6002_6003][gs_rewind]: receiving and unpacking files...
[2022-05-18 15:20:12.974][115888][dn_6001_6002_6003][gs_rewind]: execute file map success
[2022-05-18 15:20:12.976][115888][dn_6001_6002_6003][gs_rewind]: find minRecoveryPoint success from xlog insert location 0/50181B8
[2022-05-18 15:20:12.976][115888][dn_6001_6002_6003][gs_rewind]: update pg_control file success, minRecoveryPoint: 0/50181B8, ckpLoc:0/5013C80, ckpRedo:0/5013C00, preCkp:0/5013B60
[2022-05-18 15:20:12.978][115888][dn_6001_6002_6003][gs_rewind]: update pg_dw file success
[2022-05-18 15:20:12.978][115888][dn_6001_6002_6003][gs_rewind]: xlog end point: 0/50181B8
[2022-05-18 15:20:12.978][115888][dn_6001_6002_6003][gs_rewind]: waiting for background process to finish streaming...
[2022-05-18 15:20:17.652][115888][dn_6001_6002_6003][gs_rewind]: creating backup label and updating control file
[2022-05-18 15:20:17.652][115888][dn_6001_6002_6003][gs_rewind]: create backup label success
[2022-05-18 15:20:17.653][115888][dn_6001_6002_6003][gs_rewind]: read checkpoint redo (0/500B140) success.
[2022-05-18 15:20:17.653][115888][dn_6001_6002_6003][gs_rewind]: read checkpoint rec (0/500B1C0) success.
[2022-05-18 15:20:17.653][115888][dn_6001_6002_6003][gs_rewind]: dn incremental build completed.
[2022-05-18 15:20:17.656][115888][dn_6001_6002_6003][gs_rewind]: fetching MOT checkpoint
[2022-05-18 15:20:17.694][115888][dn_6001_6002_6003][gs_ctl]: waiting for server to start...
.0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 DB010 0 [REDO] LOG: Recovery parallelism, cpu count = 4, max = 4, actual = 4
2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 DB010 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: InitNuma numaNodeNum: 1 numa_distribute_mode: none inheritThreadPool: 0.
2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: reserved memory for backend threads is: 340 MB
2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: reserved memory for WAL buffers is: 320 MB
2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: Set max backend reserve memory is: 660 MB, max dynamic memory is: 1483 MB
2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: shared memory 912 Mbytes, memory context 2143 Mbytes, max process memory 3072 Mbytes
2022-05-18 15:20:17.773 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [CACHE] LOG: set data cache size(12582912)
2022-05-18 15:20:17.773 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [CACHE] LOG: set metadata cache size(4194304)
2022-05-18 15:20:17.879 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [SEGMENT_PAGE] LOG: Segment-page constants: DF_MAP_SIZE: 8156, DF_MAP_BIT_CNT: 65248, DF_MAP_GROUP_EXTENTS: 4175872, IPBLOCK_SIZE: 8168, EXTENTS_PER_IPBLOCK: 1021, IPBLOCK_GROUP_SIZE: 4090, BMT_HEADER_LEVEL0_TOTAL_PAGES: 8323072, BktMapEntryNumberPerBlock: 2038, BktMapBlockNumber: 25, BktBitMaxMapCnt: 512
2022-05-18 15:20:17.915 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: mogdb: fsync file "/mogdb/data/db1/gaussdb.state.temp" success
2022-05-18 15:20:17.915 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: create gaussdb state file success: db state(STARTING_STATE), server mode(Standby)
2022-05-18 15:20:17.950 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: max_safe_fds = 973, usable_fds = 1000, already_open = 17
The core dump path from /proc/sys/kernel/core_pattern is an invalid directory:|/usr/lib/systemd/
[2022-05-18 15:20:18.712][115888][dn_6001_6002_6003][gs_ctl]: done
[2022-05-18 15:20:18.712][115888][dn_6001_6002_6003][gs_ctl]: server started (/mogdb/data/db1)
[2022-05-18 15:20:18.712][115888][dn_6001_6002_6003][gs_ctl]: fopen build pid file "/mogdb/data/db1/gs_build.pid" success
[2022-05-18 15:20:18.712][115888][dn_6001_6002_6003][gs_ctl]: fprintf build pid file "/mogdb/data/db1/gs_build.pid" success
[2022-05-18 15:20:18.713][115888][dn_6001_6002_6003][gs_ctl]: fsync build pid file "/mogdb/data/db1/gs_build.pid" success
[omm@node2 ~]$
[omm@node1]$ gs_om -t status --detail
[ Cluster State ]
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
--------------------------------------------------------------------------------
1 node1 192.168.0.11 26000 6001 /mogdb/data/db1 P Primary Normal
2 node2 192.168.0.12 26000 6002 /mogdb/data/db1 S Standby Normal
3 node3 192.168.0.13 26000 6003 /mogdb/data/db1 S Standby Need repair(Disconnected)
[omm@node1 mogha]$
说明:node2节点恢复正常。
如此操作,在node3上进行rebuild:
...
再次检查集群状态:
[omm@node1 mogha]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Normal
redistributing : No
current_az : AZ_ALL
--------------------------------------------------------------------------------
1 node1 192.168.0.11 26000 6001 /mogdb/data/db1 P Primary Normal
2 node2 192.168.0.12 26000 6002 /mogdb/data/db1 S Standby Normal
3 node3 192.168.0.13 26000 6003 /mogdb/data/db1 S Standby Normal
[omm@node1 mogha]$




