暂无图片
暂无图片
2
暂无图片
暂无图片
暂无图片

GBase 8a 针对分片表故障自动恢复能力的验证

原创 老紫竹 2022-02-14
18960

GBase 8a是通过副本来实现高可用的。当某个节点的分片数据出现故障时,集群将自动从副本恢复该分片,本文将测试几种常见的分片故障下的GBase 8a的恢复能力,包括数据不一致,表丢失,表结构不同等。

环境

2节点集群,手工关闭集群monit工具后,通过关停服务来模拟故障。

节点

101和115两个节点

[gbase@gbase_rh7_001 ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

版本

采用了9.5.2版本。

gbase> select version();
+-------------------+
| version()         |
+-------------------+
| 9.5.2.44.283f8db1 |
+-------------------+
1 row in set (Elapsed: 00:00:00.00)

数据

创建最简单的单列的表,并插入几行数据。为了确保数据分布到不同的分片,将id做成了分布列。

[gbase@gbase_rh7_001 ~]$ gccli testdb

GBase client 9.5.2.44.283f8db1. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

gbase> create table test1(id int)distributed by('id');
Query OK, 0 rows affected (Elapsed: 00:00:00.27)

gbase> insert into test1 values(1),(2),(3),(4);
Query OK, 4 rows affected (Elapsed: 00:00:00.17)
Records: 4  Duplicates: 0  Warnings: 0

gbase> select * from test1;
+------+
| id   |
+------+
|    2 |
|    4 |
|    1 |
|    3 |
+------+
4 rows in set (Elapsed: 00:00:00.02)

[gbase@gbase_rh7_001 ~]$ gncli testdb

GBase client 9.5.2.44.283f8db1. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

Please modify user password

gbase> select * from test1_n1;
+------+
| id   |
+------+
|    2 |
|    4 |
+------+
2 rows in set (Elapsed: 00:00:00.00)

gbase> select * from test1_n2;
+------+
| id   |
+------+
|    1 |
|    3 |
+------+
2 rows in set (Elapsed: 00:00:00.00)

关闭集群进程监控服务(看门狗)

如不关闭,进程会被自动拉起,影响测试中的数据采集。

[gbase@gbase_rh7_001 ~]$ gcmonit.sh stop
Stopping GCMonit success!


故障1:单纯数据不一致

该情况最常见,分片节点故障,比如离线,网络不通等,导致数据和其它备份节点不同,集群将记录dmlevent。

当故障恢复后,集群通过监控进程gc_recover,发起sync_client同步进程,将数据从副本同步到故障分片。详情可以参考本站的其它文章。

GBase 8a数据库数据不一致dmlevent故障模拟和恢复过程分析

GBase 8a存储类故障DMLStorageEvent恢复日志解析

GBase 8a数据库数据不一致ddlevent故障模拟和恢复过程分析

分片表丢失

此情况多由于环境问题导致,比如磁盘空间满,文件系统故障等,也包括人工误操作删除。此时故障节点不存在分片表了。

删掉表分片

通过gncli 登录,删掉test1_n1分片。

gbase> drop table test1_n1;
Query OK, 0 rows affected (Elapsed: 00:00:00.00)

gbase> select * from test1_n1;
ERROR 1146 (42S02): Table 'testdb.test1_n1' doesn't exist

关闭节点101, 默认系统故障

可以看到101的gnode状态为CLOSE

[gbase@gbase_rh7_001 ~]$ gcluster_services gbase stop
Stopping gbase :                                           [  OK  ]
[gbase@gbase_rh7_001 ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | CLOSE |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

默认后续数据操作,触发event

再次向test1表插入数据

[gbase@gbase_rh7_001 ~]$ gccli testdb

GBase client 9.5.2.44.283f8db1. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

gbase> insert into test1 values(5),(6),(7),(8);
Query OK, 4 rows affected (Elapsed: 00:00:00.15)
Records: 4  Duplicates: 0  Warnings: 0

查看event

查看集群状态gcadmin, 看到可以101的DataState为1,有不一致状态。

查看event详情,有1个dmlevent, 对应101节点的test1表的分片n1和n2, 其中n1的表已经被删掉, n2只是单纯的数据不一致。

[gbase@gbase_rh7_001 ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | CLOSE |    OPEN    |     1     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 ~]$ gcadmin showdmlevent
Vc event count:1
Event ID:    2
ObjectName: testdb.test1

Fail Data Copy:
------------------------------------------------------
SegName: n1     SCN: 19459      NodeIP: 10.0.2.101      FAILURE
SegName: n2     SCN: 19459      NodeIP: 10.0.2.101      FAILURE


[gbase@gbase_rh7_001 ~]$ gcadmin showddlevent
Vc event count:0
[gbase@gbase_rh7_001 ~]$ gcadmin showdmlstorageevent
Vc event count:0
[gbase@gbase_rh7_001 ~]$ 

启动服务,模拟故障恢复

启动服务后,101节点的gnode服务状态从CLOSE变成了OPEN

[gbase@gbase_rh7_001 ~]$ gcluster_services gbase start
Starting gbase :                                           [  OK  ]
[gbase@gbase_rh7_001 ~]$ 
[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | OPEN  |    OPEN    |     1     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

查看同步日志

该日志gc_recover.log可能出现在任意一台调度节点的日志目录内,一般是第一台,如发现其日志长时间没有变动,则表示同步的master迁移到其它节点了。

从日志看,期间自动生成了dmlstorageevent, 并重建建了分片表。

DoDmlRecover 609 table not in local data node,but in source gcluster, convert to dmlstorage.

后面从副本分片拿到了表结构

QueryExecute ip 10.0.2.115,port 5050, do "show full create table "testdb"."test1_n1";"

并重建了丢失的分片表

MatchJustOneResult:CREATE TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)

之后调用sync_client同步了数据

 QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"

完整的同步信息如下

2022-02-14 10:14:33.240 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: recoverinfo gn dml: eventid:2 DB(testdb), TABLE(test1), SLICE(1), TID(609), src:1929510922(10.0.2.115) dst:1694629898(10.0.2.101)
2022-02-14 10:14:33.263 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: PreWriteDMLSBeforeSync eventid 2
2022-02-14 10:14:33.263 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:14:33.263 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:609
2022-02-14 10:14:33.263 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:33.264 [ERROR] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult, row is null, or row number is not one
2022-02-14 10:14:33.264 [ERROR] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryTableExists null
2022-02-14 10:14:33.264 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: DoDmlRecover 609 table not in local data node,but in source gcluster, convert to dmlstorage.
2022-02-14 10:14:33.264 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: delete dml event, table testdb.test1,event id 2 suffix n1
2022-02-14 10:14:33.276 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: SET_ONLINE, Succeeded in setting local node table 'testdb.test1_n1' online, eventinfo(eventid:2,segId:1)
2022-02-14 10:14:33.276 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: CheckToClearOrKeepPreWriteDMLS keep eventid 2
2022-02-14 10:14:33.293 [INFO ] <session:4>: DealDMLRecoverUnLock vc00001.testdb.test1
2022-02-14 10:14:33.521 [INFO ] <session:4>: source node is 0x7302000a, table test1, suffix n2
2022-02-14 10:14:33.555 [INFO ] <session:4>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:14:33.564 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: recoverinfo gn dml: eventid:2 DB(testdb), TABLE(test1), SLICE(2), TID(609), src:1929510922(10.0.2.115) dst:1694629898(10.0.2.101)
2022-02-14 10:14:33.573 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: PreWriteDMLSBeforeSync eventid 3
2022-02-14 10:14:33.573 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:14:33.574 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:609
2022-02-14 10:14:33.574 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:14:33.575 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19458
2022-02-14 10:14:33.575 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "FLUSH ROLLBACK "testdb"."test1_n2""
2022-02-14 10:14:33.576 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "revert table `testdb`.`test1_n2` scn_number 19459"
2022-02-14 10:14:33.576 [ERROR] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute, (1753)(Can't revert table: Table testdb.test1_n2 can not do revert with scn(19459/cur:19458)), do "revert table `testdb`.`test1_n2` scn_number 19459" error
2022-02-14 10:14:33.576 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: do revert success!
2022-02-14 10:14:33.576 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:14:33.576 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:14:33.577 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n2 10.0.2.101 5258 2 0'"
2022-02-14 10:14:33.728 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (0)
2022-02-14 10:14:33.728 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:14:33.729 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:14:33.729 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:14:33.729 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client executed successfully
2022-02-14 10:14:33.730 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "refresh table "testdb"."test1_n2""
2022-02-14 10:14:33.730 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: Sync client returned errcode:[0],errmsg:[success]
2022-02-14 10:14:33.730 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 0
2022-02-14 10:14:33.738 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,Checking whether SYNC is succeeded{DB: 'testdb', Table: 'test1', slice: 'n2'}
2022-02-14 10:14:33.738 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:14:33.738 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:33.738 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "refresh table "testdb"."test1_n2""
2022-02-14 10:14:33.739 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:14:33.740 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:33.740 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,        recover GNode: 19459
2022-02-14 10:14:33.740 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,      source GNode[1]: 19459
2022-02-14 10:14:33.740 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: delete dml event, table testdb.test1,event id 2 suffix n2
2022-02-14 10:14:33.750 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: SET_ONLINE, Succeeded in setting local node table 'testdb.test1_n2' online, eventinfo(eventid:2,segId:2)
2022-02-14 10:14:33.750 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CheckToClearOrKeepPreWriteDMLS delete eventid 3
2022-02-14 10:14:33.759 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: delete dmlstorage event, event id: 3
2022-02-14 10:14:33.769 [INFO ] <session:4>: DealDMLRecoverUnLock vc00001.testdb.test1
2022-02-14 10:14:33.973 [INFO ] <session:4>: End dml recover .,tid 609
2022-02-14 10:14:33.982 [INFO ] <session:4>: UnLock: GCRECOVER_LOCK_TID_vc00001_609 unlock success, 8130053026070134784
2022-02-14 10:14:33.993 [INFO ] <session:4>: Lock: GCRECOVER_LOCK_TID_vc00001_609 end success, 1600173590353805312
2022-02-14 10:14:33.995 [INFO ] <session:4>: Start DMLStorge recover .,tid 609 eventnum 1
2022-02-14 10:14:34.012 [INFO ] <session:4>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:14:34.013 [INFO ] <session:4>: MatchJustOneResult:609
2022-02-14 10:14:34.027 [INFO ] <session:4>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:14:34.035 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: recoverinfo gn(dmls): eventid=2, tablename=testdb.test1, src:0x1929510922(10.0.2.115) dst:0x1694629898(10.0.2.101)
2022-02-14 10:14:34.035 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "DROP SELF TABLE IF EXISTS "testdb"."test1_n1";"
2022-02-14 10:14:34.035 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:34.036 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:34.036 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:34.037 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:34.037 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "show full create table "testdb"."test1_n1";"
2022-02-14 10:14:34.037 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:CREATE TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)
2022-02-14 10:14:34.037 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "use self "testdb";"
2022-02-14 10:14:34.038 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "SET SELF SCN = 19459"
2022-02-14 10:14:34.038 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "CREATE SELF TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)"
2022-02-14 10:14:34.060 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:14:34.060 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:14:34.061 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"
2022-02-14 10:14:34.374 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (0)
2022-02-14 10:14:34.374 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:14:34.374 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:14:34.375 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:14:34.375 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client executed successfully
2022-02-14 10:14:34.375 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "refresh table "testdb"."test1_n1""
2022-02-14 10:14:34.376 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: Sync client returned errcode:[0],errmsg:[success]
2022-02-14 10:14:34.376 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 0
2022-02-14 10:14:34.380 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,Checking whether SYNC is succeeded{DB: 'testdb', Table: 'test1', slice: 'n1'}
2022-02-14 10:14:34.380 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:34.381 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:34.381 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "refresh table "testdb"."test1_n1""
2022-02-14 10:14:34.382 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:34.383 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:34.383 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,        recover GNode: 19459
2022-02-14 10:14:34.383 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,      source GNode[1]: 19459
2022-02-14 10:14:34.383 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: Delete DMLStorge eventId 2
2022-02-14 10:14:34.396 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: delete dmlstorage event, event id: 2
2022-02-14 10:14:34.404 [INFO ] <session:4>: DealDMLSRecoverUnLock vc00001.testdb.test1
2022-02-14 10:14:34.817 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: UnLock: GCRECOVER_LOCK_TID_vc00001_609 unlock success, 1600173590353805312

表丢失时自动恢复的总结

当集群同步进程发现目标表丢失时,会自动从数据不一致的dmlevent升级到存储不一致的dmlstorageevent, 然后从副本获得表结构,重建建立分片表,然后再同步数据实现整个恢复过程。

分片表的表结构不一致

此种情况,在普通场景是不会出现的,除了非常古老版本的产品BUG,在新版本里,最常见的是磁盘满了,或者异常断电,导致分片没有完成表结构变动,同时,因为磁盘满了或异常断电,连event都没有被正确记录到物理磁盘上。

构造表结构不一致

通过在分片表,增加一个字段来模拟主副分片的表结构不一致。

gncli ......

gbase> desc test1_n1;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id    | int(11) | YES  |     | NULL    |       |
+-------+---------+------+-----+---------+-------+
1 row in set (Elapsed: 00:00:00.01)

gbase> alter table test1_n1 add column name varchar(100);\
Query OK, 5 rows affected (Elapsed: 00:00:00.07)
Records: 5  Duplicates: 0  Warnings: 0

gbase> desc test1_n1;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(11)      | YES  |     | NULL    |       |
| name  | varchar(100) | YES  |     | NULL    |       |
+-------+--------------+------+-----+---------+-------+
2 rows in set (Elapsed: 00:00:00.00)

gbase> desc test1_n2;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id    | int(11) | YES  |     | NULL    |       |
+-------+---------+------+-----+---------+-------+
1 row in set (Elapsed: 00:00:00.00)

模拟故障

停掉服务模拟故障,变动数据生成event,启动服务模拟故障恢复。

[gbase@gbase_rh7_001 gcluster]$ gcluster_services gbase stop
Stopping gbase :                                           [  OK  ]
[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | CLOSE |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 gcluster]$ gccli testdb

GBase client 9.5.2.44.283f8db1. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

gbase> insert into test1 values(9),(10),(11),(12);
Query OK, 4 rows affected (Elapsed: 00:00:00.15)
Records: 4  Duplicates: 0  Warnings: 0

gbase> ^CAborted
[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | CLOSE |    OPEN    |     1     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 gcluster]$
[gbase@gbase_rh7_001 gcluster]$ gcadmin showdmlevent
Vc event count:1
Event ID:    3
ObjectName: testdb.test1

Fail Data Copy:
------------------------------------------------------
SegName: n1     SCN: 19460      NodeIP: 10.0.2.101      FAILURE
SegName: n2     SCN: 19460      NodeIP: 10.0.2.101      FAILURE


[gbase@gbase_rh7_001 gcluster]$ gcadmin showddlevent
Vc event count:0
[gbase@gbase_rh7_001 gcluster]$ gcadmin showdmlstorageevent
Vc event count:0

[gbase@gbase_rh7_001 gcluster]$ gcluster_services gbase start
Starting gbase :                                           [  OK  ]
[gbase@gbase_rh7_001 gcluster]$ 

查看同步日志

集群同步服务先调用sync_client进行数据恢复

2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"

恢复数报错

2022-02-14 10:24:32.784 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute, (1733)((GBA-01EX-700) Gbase general error: gc_sync_client returned error code 30.), do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'" error

报错原因是表结构不一致

2022-02-14 10:24:32.787 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: Sync client returned errcode:[30],errmsg:[show full create table for sync is different.]

删掉了不一致的分片表

2022-02-14 10:24:33.674 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "DROP SELF TABLE IF EXISTS "testdb"."test1_n1";"

从副本拿到了表结构

2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "show full create table "testdb"."test1_n1";"

创建分片表

2022-02-14 10:24:33.678 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "CREATE SELF TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)"

调用sync_client恢复数据

2022-02-14 10:24:33.708 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"

完整的恢复日志如下

2022-02-14 10:24:32.659 [INFO ] <session:6>: Lock: GCRECOVER_LOCK_TID_vc00001_609 end success, 1091491459250520064
2022-02-14 10:24:32.660 [INFO ] <session:6>: Start dml recover .,tid 609, eventid 3, Num 2
2022-02-14 10:24:32.669 [INFO ] <session:6>: GetSourceNodeForDMl for nodeId(1694629898), the node is inValid, uValid=1
2022-02-14 10:24:32.669 [INFO ] <session:6>: source node is 0x7302000a, table test1, suffix n1
2022-02-14 10:24:32.705 [INFO ] <session:6>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:24:32.707 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: recoverinfo gn dml: eventid:3 DB(testdb), TABLE(test1), SLICE(1), TID(609), src:1929510922(10.0.2.115) dst:1694629898(10.0.2.101)
2022-02-14 10:24:32.726 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: PreWriteDMLSBeforeSync eventid 4
2022-02-14 10:24:32.726 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:24:32.728 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:609
2022-02-14 10:24:32.728 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:32.730 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:0
2022-02-14 10:24:32.730 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "FLUSH ROLLBACK "testdb"."test1_n1""
2022-02-14 10:24:32.731 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "revert table `testdb`.`test1_n1` scn_number 19460"
2022-02-14 10:24:32.732 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute, (1753)(Can't revert table: Table testdb.test1_n1 can not do revert with scn(19460/cur:0)), do "revert table `testdb`.`test1_n1` scn_number 19460" error
2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: do revert success!
2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"
2022-02-14 10:24:32.784 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute, (1733)((GBA-01EX-700) Gbase general error: gc_sync_client returned error code 30.), do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'" error
2022-02-14 10:24:32.784 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QuerySyncCommand, Sync error
2022-02-14 10:24:32.784 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (30)
2022-02-14 10:24:32.785 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:24:32.785 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:32.786 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:32.786 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client return ERR_SHOW_CREATE_TABLE_IS_DIFF
2022-02-14 10:24:32.787 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: Sync client returned errcode:[30],errmsg:[show full create table for sync is different.]
2022-02-14 10:24:32.787 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 26
2022-02-14 10:24:32.790 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: delete dml event, table testdb.test1,event id 3 suffix n1
2022-02-14 10:24:32.799 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: SET_ONLINE, Succeeded in setting local node table 'testdb.test1_n1' online, eventinfo(eventid:3,segId:1)
2022-02-14 10:24:32.799 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: CheckToClearOrKeepPreWriteDMLS keep eventid 4
2022-02-14 10:24:32.809 [INFO ] <session:6>: DealDMLRecoverUnLock vc00001.testdb.test1
2022-02-14 10:24:33.025 [INFO ] <session:6>: source node is 0x7302000a, table test1, suffix n2
2022-02-14 10:24:33.071 [INFO ] <session:6>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:24:33.074 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: recoverinfo gn dml: eventid:3 DB(testdb), TABLE(test1), SLICE(2), TID(609), src:1929510922(10.0.2.115) dst:1694629898(10.0.2.101)
2022-02-14 10:24:33.087 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: PreWriteDMLSBeforeSync eventid 5
2022-02-14 10:24:33.088 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:24:33.089 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:609
2022-02-14 10:24:33.089 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:24:33.090 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:24:33.090 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "FLUSH ROLLBACK "testdb"."test1_n2""
2022-02-14 10:24:33.091 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "revert table `testdb`.`test1_n2` scn_number 19460"
2022-02-14 10:24:33.092 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute, (1753)(Can't revert table: Table testdb.test1_n2 can not do revert with scn(19460/cur:19459)), do "revert table `testdb`.`test1_n2` scn_number 19460" error
2022-02-14 10:24:33.092 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: do revert success!
2022-02-14 10:24:33.092 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.093 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.093 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n2 10.0.2.101 5258 2 0'"
2022-02-14 10:24:33.393 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (0)
2022-02-14 10:24:33.393 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:24:33.393 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.394 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.394 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client executed successfully
2022-02-14 10:24:33.394 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "refresh table "testdb"."test1_n2""
2022-02-14 10:24:33.395 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: Sync client returned errcode:[0],errmsg:[success]
2022-02-14 10:24:33.395 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 0
2022-02-14 10:24:33.399 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,Checking whether SYNC is succeeded{DB: 'testdb', Table: 'test1', slice: 'n2'}
2022-02-14 10:24:33.399 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:24:33.400 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.400 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "refresh table "testdb"."test1_n2""
2022-02-14 10:24:33.401 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:24:33.401 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.402 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,        recover GNode: 19460
2022-02-14 10:24:33.402 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,      source GNode[1]: 19460
2022-02-14 10:24:33.402 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: delete dml event, table testdb.test1,event id 3 suffix n2
2022-02-14 10:24:33.411 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: SET_ONLINE, Succeeded in setting local node table 'testdb.test1_n2' online, eventinfo(eventid:3,segId:2)
2022-02-14 10:24:33.411 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CheckToClearOrKeepPreWriteDMLS delete eventid 5
2022-02-14 10:24:33.419 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: delete dmlstorage event, event id: 5
2022-02-14 10:24:33.430 [INFO ] <session:6>: DealDMLRecoverUnLock vc00001.testdb.test1
2022-02-14 10:24:33.631 [INFO ] <session:6>: End dml recover .,tid 609
2022-02-14 10:24:33.633 [INFO ] <session:6>: UnLock: GCRECOVER_LOCK_TID_vc00001_609 unlock success, 1091491459250520064
2022-02-14 10:24:33.636 [INFO ] <session:6>: Lock: GCRECOVER_LOCK_TID_vc00001_609 end success, 2113927811545694208
2022-02-14 10:24:33.636 [INFO ] <session:6>: Start DMLStorge recover .,tid 609 eventnum 1
2022-02-14 10:24:33.650 [INFO ] <session:6>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:24:33.651 [INFO ] <session:6>: MatchJustOneResult:609
2022-02-14 10:24:33.671 [INFO ] <session:6>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:24:33.674 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: recoverinfo gn(dmls): eventid=4, tablename=testdb.test1, src:0x1929510922(10.0.2.115) dst:0x1694629898(10.0.2.101)
2022-02-14 10:24:33.674 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "DROP SELF TABLE IF EXISTS "testdb"."test1_n1";"
2022-02-14 10:24:33.675 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:33.676 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.676 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "show full create table "testdb"."test1_n1";"
2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:CREATE TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)
2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "use self "testdb";"
2022-02-14 10:24:33.678 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "SET SELF SCN = 19460"
2022-02-14 10:24:33.678 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "CREATE SELF TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)"
2022-02-14 10:24:33.707 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.707 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.708 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"
2022-02-14 10:24:33.879 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (0)
2022-02-14 10:24:33.879 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:24:33.879 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.880 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.880 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client executed successfully
2022-02-14 10:24:33.880 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "refresh table "testdb"."test1_n1""
2022-02-14 10:24:33.881 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: Sync client returned errcode:[0],errmsg:[success]
2022-02-14 10:24:33.881 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 0
2022-02-14 10:24:33.884 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,Checking whether SYNC is succeeded{DB: 'testdb', Table: 'test1', slice: 'n1'}
2022-02-14 10:24:33.884 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:33.885 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.885 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "refresh table "testdb"."test1_n1""
2022-02-14 10:24:33.886 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:33.886 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.887 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,        recover GNode: 19460
2022-02-14 10:24:33.887 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,      source GNode[1]: 19460
2022-02-14 10:24:33.887 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: Delete DMLStorge eventId 4
2022-02-14 10:24:33.895 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: delete dmlstorage event, event id: 4
2022-02-14 10:24:33.904 [INFO ] <session:6>: DealDMLSRecoverUnLock vc00001.testdb.test1
2022-02-14 10:24:34.109 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: UnLock: GCRECOVER_LOCK_TID_vc00001_609 unlock success, 2113927811545694208

分片表结构不一致的恢复总结

当同步进程发送同步命令返回失败时,根据错误原因判断是主副本表结构不同,则删除掉故障的表,从副本获得比结构后重建分片。最后同步数据完成整个恢复过程。

总结

本文通过几种常见的主副本不一致的情况测试,包括数据不一致,表丢失,表结构不一致等,确认GBase 8a在这些情况下都能自动完成恢复工作,确保集群后续的数据运行安全。

最后修改时间:2022-02-25 14:37:33
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论