暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

openGauss训练营学习心得-Standby Need repair(WAL)的问题解决

原创 姜殿斌 恩墨学院 2022-05-18
855

问题:MogDB2.1.1三节点集群,备库异常:

[omm@node1 mogha]$ gs_om -t status --detail

[   Cluster State   ]

 cluster_state   : Degraded

redistributing  : No

current_az      : AZ_ALL

 [  Datanode State   ]

     node node_ip         port      instance                state

--------------------------------------------------------------------------------

1  node1 192.168.0.11    26000      6001 /mogdb/data/db1   P Primary Normal

2  node2 192.168.0.12    26000      6002 /mogdb/data/db1   S Standby Need repair(WAL)

3  node3 192.168.0.13    26000      6003 /mogdb/data/db1   S Standby Need repair(WAL)

[omm@node1 mogha]$

原因:

原因可能由于网络故障、磁盘满等原因造成主备实例连接断开,主备日志不同步,导致数据库在启动时异常。

解决:

需要在node2、node3上进行数据的rebuild。

操作步骤:

在有问题的备机node2 上操作:

[omm@node2 ~]$  gs_ctl build -D /mogdb/data/db1 -b  node2

[2022-05-18 15:20:07.426][115888][][gs_ctl]: gs_ctl incremental build ,datadir is /mogdb/data/db1

waiting for server to shut down........ done

server stopped

[2022-05-18 15:20:12.463][115888][][gs_ctl]: fopen build pid file "/mogdb/data/db1/gs_build.pid" success

[2022-05-18 15:20:12.464][115888][][gs_ctl]: fprintf build pid file "/mogdb/data/db1/gs_build.pid" success

[2022-05-18 15:20:12.465][115888][][gs_ctl]: fsync build pid file "/mogdb/data/db1/gs_build.pid" success

[2022-05-18 15:20:12.469][115888][dn_6001_6002_6003][gs_rewind]: set gaussdb state file when rewind:db state(BUILDING_STATE), server mode(STANDBY_MODE), build mode(INC_BUILD).

[2022-05-18 15:20:12.517][115888][dn_6001_6002_6003][gs_rewind]: connected to server: host=192.168.0.11 port=26001 dbname=postgres application_name=gs_rewind connect_timeout=5  rw_timeout=600

[2022-05-18 15:20:12.527][115888][dn_6001_6002_6003][gs_rewind]: connect to primary success

[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: get pg_control success

[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: target server was interrupted in mode 2.

[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: sanityChecks success

[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: find last checkpoint at 0/5013C80 and checkpoint redo at 0/5013C00 from source control file

[2022-05-18 15:20:12.530][115888][dn_6001_6002_6003][gs_rewind]: find last checkpoint at 0/500B9A0 and checkpoint redo at 0/500B920 from target control file

[2022-05-18 15:20:12.532][115888][dn_6001_6002_6003][gs_rewind]: find max lsn success, find max lsn rec (0/500B9A0) success.

 [2022-05-18 15:20:12.542][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B9A0 and its crc(source, target):[1629831145, 1003012129]

[2022-05-18 15:20:12.547][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B880 and its crc(source, target):[3740980668, 319030857]

[2022-05-18 15:20:12.572][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B760 and its crc(source, target):[1584992055, 762019665]

[2022-05-18 15:20:12.579][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B640 and its crc(source, target):[2946690191, 1650212730]

[2022-05-18 15:20:12.584][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B520 and its crc(source, target):[3009108653, 2629000000]

[2022-05-18 15:20:12.588][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B400 and its crc(source, target):[1968665500, 2528249068]

[2022-05-18 15:20:12.592][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B2E0 and its crc(source, target):[962672013, 3933412142]

[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: request lsn is 0/500B1C0 and its crc(source, target):[2948223938, 2948223938]

[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: find common checkpoint 0/500B1C0

[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: find diverge point success

[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: read checkpoint redo (0/500B140) success before rewinding.

[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: rewinding from checkpoint redo point at 0/500B140 on timeline 1

[2022-05-18 15:20:12.596][115888][dn_6001_6002_6003][gs_rewind]: diverge xlogfile is 000000010000000000000005, older ones will not be copied or removed.

[2022-05-18 15:20:12.598][115888][dn_6001_6002_6003][gs_rewind]: targetFileStatThread success pid 140717470623488.

[2022-05-18 15:20:12.598][115888][dn_6001_6002_6003][gs_rewind]: reading source file list

[2022-05-18 15:20:12.598][115888][dn_6001_6002_6003][gs_rewind]: traverse_datadir start.

[2022-05-18 15:20:12.611][115888][dn_6001_6002_6003][gs_rewind]: filemap_list_to_array start.

[2022-05-18 15:20:12.612][115888][dn_6001_6002_6003][gs_rewind]: filemap_list_to_array end sort start. length is 1378

[2022-05-18 15:20:12.612][115888][dn_6001_6002_6003][gs_rewind]: sort end.

[2022-05-18 15:20:12.615][115888][dn_6001_6002_6003][gs_rewind]: targetFileStatThread return success.

[2022-05-18 15:20:12.618][115888][dn_6001_6002_6003][gs_rewind]: reading target file list

[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: traverse target datadir success

[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: reading WAL in target

[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: could not read WAL record at 0/500BA40: invalid record length at 0/500BA40: wanted 32, got 0

[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: calculate totals rewind success

[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: need to copy 56MB (total source directory size is 86MB)

[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: starting background WAL receiver

[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: Starting copy xlog, start point: 0/500B140

[2022-05-18 15:20:12.620][115888][dn_6001_6002_6003][gs_rewind]: in gs_rewind proecess,so no need remove.

[2022-05-18 15:20:12.624][115888][dn_6001_6002_6003][gs_rewind]: check identify system success

[2022-05-18 15:20:12.625][115888][dn_6001_6002_6003][gs_rewind]: send START_REPLICATION 0/5000000 success

[2022-05-18 15:20:12.648][115888][dn_6001_6002_6003][gs_rewind]: receiving and unpacking files...

[2022-05-18 15:20:12.974][115888][dn_6001_6002_6003][gs_rewind]: execute file map success

[2022-05-18 15:20:12.976][115888][dn_6001_6002_6003][gs_rewind]: find minRecoveryPoint success from xlog insert location 0/50181B8

[2022-05-18 15:20:12.976][115888][dn_6001_6002_6003][gs_rewind]: update pg_control file success, minRecoveryPoint: 0/50181B8, ckpLoc:0/5013C80, ckpRedo:0/5013C00, preCkp:0/5013B60

[2022-05-18 15:20:12.978][115888][dn_6001_6002_6003][gs_rewind]: update pg_dw file success

[2022-05-18 15:20:12.978][115888][dn_6001_6002_6003][gs_rewind]: xlog end point: 0/50181B8

[2022-05-18 15:20:12.978][115888][dn_6001_6002_6003][gs_rewind]: waiting for background process to finish streaming...

[2022-05-18 15:20:17.652][115888][dn_6001_6002_6003][gs_rewind]: creating backup label and updating control file

[2022-05-18 15:20:17.652][115888][dn_6001_6002_6003][gs_rewind]: create backup label success

[2022-05-18 15:20:17.653][115888][dn_6001_6002_6003][gs_rewind]: read checkpoint redo (0/500B140) success.

[2022-05-18 15:20:17.653][115888][dn_6001_6002_6003][gs_rewind]: read checkpoint rec (0/500B1C0) success.

[2022-05-18 15:20:17.653][115888][dn_6001_6002_6003][gs_rewind]: dn incremental build completed.

[2022-05-18 15:20:17.656][115888][dn_6001_6002_6003][gs_rewind]: fetching MOT checkpoint

[2022-05-18 15:20:17.694][115888][dn_6001_6002_6003][gs_ctl]: waiting for server to start...

.0 LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

 0 LOG:  [Alarm Module]Host Name: node2

 0 LOG:  [Alarm Module]Host IP: 192.168.0.12

 0 LOG:  [Alarm Module]Cluster Name: dbCluster

 The core dump path from /proc/sys/kernel/core_pattern is an invalid directory:|/usr/lib/systemd/

2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  Recovery parallelism, cpu count = 4, max = 4, actual = 4

2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4

2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

 2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Host Name: node2

 2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Host IP: 192.168.0.12

 2022-05-18 15:20:17.752 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Cluster Name: dbCluster

 2022-05-18 15:20:17.756 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  loaded library "security_plugin"

2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  InitNuma numaNodeNum: 1 numa_distribute_mode: none inheritThreadPool: 0.

2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  reserved memory for backend threads is: 340 MB

2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  reserved memory for WAL buffers is: 320 MB

2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  Set max backend reserve memory is: 660 MB, max dynamic memory is: 1483 MB

2022-05-18 15:20:17.759 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  shared memory 912 Mbytes, memory context 2143 Mbytes, max process memory 3072 Mbytes

2022-05-18 15:20:17.773 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [CACHE] LOG:  set data cache  size(12582912)

2022-05-18 15:20:17.773 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [CACHE] LOG:  set metadata cache  size(4194304)

2022-05-18 15:20:17.879 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [SEGMENT_PAGE] LOG:  Segment-page constants: DF_MAP_SIZE: 8156, DF_MAP_BIT_CNT: 65248, DF_MAP_GROUP_EXTENTS: 4175872, IPBLOCK_SIZE: 8168, EXTENTS_PER_IPBLOCK: 1021, IPBLOCK_GROUP_SIZE: 4090, BMT_HEADER_LEVEL0_TOTAL_PAGES: 8323072, BktMapEntryNumberPerBlock: 2038, BktMapBlockNumber: 25, BktBitMaxMapCnt: 512

2022-05-18 15:20:17.915 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  mogdb: fsync file "/mogdb/data/db1/gaussdb.state.temp" success

2022-05-18 15:20:17.915 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  create gaussdb state file success: db state(STARTING_STATE), server mode(Standby)

2022-05-18 15:20:17.950 62849e31.1 [unknown] 139666705155584 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  max_safe_fds = 973, usable_fds = 1000, already_open = 17

The core dump path from /proc/sys/kernel/core_pattern is an invalid directory:|/usr/lib/systemd/

 

[2022-05-18 15:20:18.712][115888][dn_6001_6002_6003][gs_ctl]:  done

[2022-05-18 15:20:18.712][115888][dn_6001_6002_6003][gs_ctl]: server started (/mogdb/data/db1)

[2022-05-18 15:20:18.712][115888][dn_6001_6002_6003][gs_ctl]: fopen build pid file "/mogdb/data/db1/gs_build.pid" success

[2022-05-18 15:20:18.712][115888][dn_6001_6002_6003][gs_ctl]: fprintf build pid file "/mogdb/data/db1/gs_build.pid" success

[2022-05-18 15:20:18.713][115888][dn_6001_6002_6003][gs_ctl]: fsync build pid file "/mogdb/data/db1/gs_build.pid" success

[omm@node2 ~]$

 在主节点查看集群状态:

[omm@node1]$ gs_om -t status --detail

[   Cluster State   ]

 cluster_state   : Degraded

redistributing  : No

current_az      : AZ_ALL

[  Datanode State   ]

    node node_ip         port      instance                state

--------------------------------------------------------------------------------

1  node1 192.168.0.11    26000      6001 /mogdb/data/db1   P Primary Normal

2  node2 192.168.0.12    26000      6002 /mogdb/data/db1   S Standby Normal

3  node3 192.168.0.13    26000      6003 /mogdb/data/db1   S Standby Need repair(Disconnected)

[omm@node1 mogha]$

说明:node2节点恢复正常。

如此操作,在node3上进行rebuild:

 [omm@node3 ~]$  gs_ctl build -D /mogdb/data/db1 -b  node3

...

再次检查集群状态:

[omm@node1 mogha]$ gs_om -t status --detail

[   Cluster State   ]

cluster_state   : Normal

redistributing  : No

current_az      : AZ_ALL

 [  Datanode State   ]

     node node_ip         port      instance                state

--------------------------------------------------------------------------------

1  node1 192.168.0.11    26000      6001 /mogdb/data/db1   P Primary Normal

2  node2 192.168.0.12    26000      6002 /mogdb/data/db1   S Standby Normal

3  node3 192.168.0.13    26000      6003 /mogdb/data/db1   S Standby Normal

[omm@node1 mogha]$

 说明:备库build成功,集群恢复正常。

End

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论